This package estimates linear models with high dimensional categorical variables and/or instrumental variables.

Its objective is similar to the Stata command `reghdfe`

and the R function `felm`

. The package is usually much faster than these two options. The package implements a novel algorithm, which combines projection methods with the conjugate gradient descent.

To install the package,

```
Pkg.add("FixedEffectModels")
```

To estimate a `@model`

, specify a formula with, eventually, a set of fixed effects with the argument `fe`

, a way to compute standard errors with the argument `vcov`

, and a weight variable with `weights`

.

```
using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
df[:StatePooled] = pool(df[:State])
df[:YearPooled] = pool(df[:Year])
reg(df, @model(Sales ~ NDI, fe = StatePooled + YearPooled, weights = Pop, vcov = cluster(StatePooled)))
# =====================================================================
# Number of obs: 1380 Degrees of freedom: 31
# R2: 0.804 R2 within: 0.139
# F-Statistic: 13.3481 p-value: 0.000
# Iterations: 6 Converged: true
# =====================================================================
# Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95%
# ---------------------------------------------------------------------
# NDI -0.00526264 0.00144043 -3.65351 0.000 -0.00808837 -0.00243691
# =====================================================================
```

A typical formula is composed of one dependent variable, exogeneous variables, endogeneous variables, and instrumental variables.

`dependent variable ~ exogenous variables + (endogenous variables ~ instrumental variables)`

Fixed effect variables are indicated with the keyword argument

`fe`

. They must be of type PooledDataArray (use`pool`

to convert a variable to a`PooledDataArray`

).`df[:StatePooled] = pool(df[:State]) # one high dimensional fixed effect fe = StatePooled`

You can add an arbitrary number of high dimensional fixed effects, separated with

`+`

`df[:YearPooled] = pool(df[:Year]) fe = StatePooled + YearPooled`

Interact multiple categorical variables using

`&`

`fe = StatePooled&DecPooled`

Interact a categorical variable with a continuous variable using

`&`

`fe = StatePooled + StatePooled&Year`

Alternative, use

`*`

to add a categorical variable and its interaction with a continuous variable`fe = StatePooled*Year # equivalent to fe = StatePooled + StatePooled&year`

Standard errors are indicated with the keyword argument

`vcov`

.`vcov = robust vcov = cluster(StatePooled) vcov = cluster(StatePooled + YearPooled)`

weights are indicated with the keyword argument

`weights`

`weights = Pop`

Arguments of `@model`

are captured and transformed into expressions. If you want to program with `@model`

, use expression interpolations:

```
using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
w = :Pop
reg(df, @model(Sales ~ NDI, weights = $(w)))
```

`reg`

returns a light object. It is composed of

- the vector of coefficients & the covariance matrix
- a boolean vector reporting rows used in the estimation
- a set of scalars (number of observations, the degree of freedoms, r2, etc)
- with the option
`save = true`

, a dataframe aligned with the initial dataframe with residuals and, if the model contains high dimensional fixed effects, fixed effects estimates.

Methods such as `predict`

, `residuals`

are still defined but require to specify a dataframe as a second argument. The problematic size of `lm`

and `glm`

models in R or Julia is discussed here, here, here here (and for absurd consequences, here and there).

Denote the model `y = X β + D θ + e`

where X is a matrix with few columns and D is the design matrix from categorical variables. Estimates for `β`

, along with their standard errors, are obtained in two steps:

`y, X`

are regressed on`D`

by one of these methods- MINRES on the normal equation with
`method = lsmr`

(with a diagonal preconditioner). - sparse factorization with
`method = cholesky`

or`method = qr`

(using the SuiteSparse library)

- MINRES on the normal equation with

The default method`lsmr`

, should be the fastest in most cases. If the method does not converge, frist please get in touch, I'd be interested to hear about your problem. Second use the `method = cholesky`

, which should do the trick.

Estimates for

`β`

, along with their standard errors, are obtained by regressing the projected`y`

on the projected`X`

(an application of the Frisch Waugh-Lovell Theorem)With the option

`save = true`

, estimates for the high dimensional fixed effects are obtained after regressing the residuals of the full model minus the residuals of the partialed out models on`D`

Baum, C. and Schaffer, M. (2013) *AVAR: Stata module to perform asymptotic covariance estimation for iid and non-iid data robust to heteroskedasticity, autocorrelation, 1- and 2-way clustering, and common cross-panel autocorrelated disturbances*. Statistical Software Components, Boston College Department of Economics.

Correia, S. (2014) *REGHDFE: Stata module to perform linear or instrumental-variable regression absorbing any number of high-dimensional fixed effects*. Statistical Software Components, Boston College Department of Economics.

Fong, DC. and Saunders, M. (2011) *LSMR: An Iterative Algorithm for Sparse Least-Squares Problems*. SIAM Journal on Scientific Computing

Gaure, S. (2013) *OLS with Multiple High Dimensional Category Variables*. Computational Statistics and Data Analysis

Kleibergen, F, and Paap, R. (2006) *Generalized reduced rank tests using the singular value decomposition.* Journal of econometrics

Kleibergen, F. and Schaffer, M. (2007) *RANKTEST: Stata module to test the rank of a matrix using the Kleibergen-Paap rk statistic*. Statistical Software Components, Boston College Department of Economics.

06/09/2015

13 days ago

173 commits