06/09/2015

6 days ago

139 commits

This package estimates linear models with high dimensional categorical variables and/or instrumental variables.

Its objective is similar to the Stata command `reghdfe`

and the R function `felm`

. The package is usually much faster than these two options. The package implements a novel algorithm, which combines projection methods with the conjugate gradient descent.

To install the package,

```
Pkg.add("FixedEffectModels")
```

To estimate a linear model, one needs to specify a formula with, and, eventually, a set of fixed effects with the argument `fe`

, a way to compute standard errors with the argument `vcov`

, and a weight variable with `weight`

.

```
using DataFrames, RDatasets, FixedEffectModels
df = dataset("plm", "Cigar")
df[:StatePooled] = pool(df[:State])
df[:YearPooled] = pool(df[:Year])
@reg df Sales ~ NDI fe = StatePooled + YearPooled weight = Pop vcov = cluster(StatePooled)
# =====================================================================
# Number of obs 1380 Degree of freedom 93
# R2 0.245 R2 Adjusted 0.190
# F Stat 417.342 p-val 0.000
# Iterations 2 Converged: true
# =====================================================================
# Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95%
# ---------------------------------------------------------------------
# NDI -0.00568607 0.000278334 -20.429 0.000 -0.00623211 -0.00514003
# =====================================================================
```

A typical formula is composed of one dependent variable, exogeneous variables, endogeneous variables, and instrumental variables.

`dependent variable ~ exogenous variables + (endogenous variables ~ instrumental variables)`

Fixed effect variables are indicated with the keyword argument

`fe`

. They must be of type PooledDataArray (use`pool`

to convert a variable to a`PooledDataArray`

).`df[:StatePooled] = pool(df[:State]) # one high dimensional fixed effect fe = StatePooled`

You can add an arbitrary number of high dimensional fixed effects, separated with

`+`

`df[:YearPooled] = pool(df[:Year]) fe = StatePooled + YearPooled`

Interact multiple categorical variables using

`&`

`fe = StatePooled&DecPooled`

Interact a categorical variable with a continuous variable using

`&`

`fe = StatePooled + StatePooled&Year`

Alternative, use

`*`

to add a categorical variable and its interaction with a continuous variable`fe = StatePooled*Year # equivalent to fe = StatePooled + StatePooled&year`

Standard errors are indicated with the keyword argument

`vcov`

.`vcov = robust() vcov = cluster(StatePooled) vcov = cluster(StatePooled + YearPooled)`

weights are indicated with the keyword argument

`weight`

`weight = Pop`

`reg`

returns a light object. It is composed of

- the vector of coefficients & the covariance matrix
- a boolean vector reporting rows used in the estimation
- a set of scalars (number of observations, the degree of freedoms, r2, etc)
- with the option
`save = true`

, a dataframe aligned with the initial dataframe with residuals and, if the model contains high dimensional fixed effects, fixed effects estimates.

Methods such as `predict`

, `residuals`

are still defined but require to specify a dataframe as a second argument. The problematic size of `lm`

and `glm`

models in R or Julia is discussed here, here, here here (and for absurd consequences, here and there).

Denote the model `y = X β + D θ + e`

where X is a matrix with few columns and D is the design matrix from categorical variables. Estimates for `β`

, along with their standard errors, are obtained in two steps:

`y, X`

are regressed on`D`

by one of these methods- MINRES on the normal equation with
`method = :lsmr`

(with a diagonal preconditioner). - sparse factorization with
`method = :cholesky`

or`method = :qr`

(using the SuiteSparse library)

- MINRES on the normal equation with

The default method`:lsmr`

, should be the fastest in most cases. If the method does not converge, frist please get in touch, I'd be interested to hear about your problem. Second use the `method = :cholesky`

, which should do the trick.

Estimates for

`β`

, along with their standard errors, are obtained by regressing the projected`y`

on the projected`X`

(an application of the Frisch Waugh-Lovell Theorem)With the option

`save = true`

, estimates for the high dimensional fixed effects are obtained after regressing the residuals of the full model minus the residuals of the partialed out models on`D`

Baum, C. and Schaffer, M. (2013) *AVAR: Stata module to perform asymptotic covariance estimation for iid and non-iid data robust to heteroskedasticity, autocorrelation, 1- and 2-way clustering, and common cross-panel autocorrelated disturbances*. Statistical Software Components, Boston College Department of Economics.

Correia, S. (2014) *REGHDFE: Stata module to perform linear or instrumental-variable regression absorbing any number of high-dimensional fixed effects*. Statistical Software Components, Boston College Department of Economics.

Fong, DC. and Saunders, M. (2011) *LSMR: An Iterative Algorithm for Sparse Least-Squares Problems*. SIAM Journal on Scientific Computing

Gaure, S. (2013) *OLS with Multiple High Dimensional Category Variables*. Computational Statistics and Data Analysis

Kleibergen, F, and Paap, R. (2006) *Generalized reduced rank tests using the singular value decomposition.* Journal of econometrics

Kleibergen, F. and Schaffer, M. (2007) *RANKTEST: Stata module to test the rank of a matrix using the Kleibergen-Paap rk statistic*. Statistical Software Components, Boston College Department of Economics.