Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation for Julia.



Build Status CovarianceMatrices CovarianceMatrices CovarianceMatrices Coverage Status codecov.io

Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation for Julia.


The package is registered on METADATA, so to install



This package provides types and methods useful to obtain consistent estimates of the long run covariance matrix of a random process.

Three classes of estimators are considered:

  1. HAC - heteroskedasticity and autocorrelation consistent (Andrews, 1996; Newey and West, 1994)
  2. HC - hetheroskedasticity (White, 1982)
  3. CRVE - cluster robust (Arellano, 1986)

The typical application of these estimators is to conduct robust inference about parameters of a model. This is accomplished by extending methods defined in StatsBase.jl and GLM.jl.

Quick tour

HAC (Heteroskedasticity and Autocorrelation Consistent)

Available kernel types are:

  • TruncatedKernel
  • BartlettKernel
  • ParzenKernel
  • TukeyHanningKernel
  • QuadraticSpectralKernel

For example, ParzenKernel(NeweyWest) return an instance of TruncatedKernel parametrized by NeweyWest, the type that corresponds to the optimal bandwidth calculated following Newey and West (1994). Similarly, ParzenKernel(Andrews) corresponds to the optimal bandwidth obtained in Andrews (1991). If the bandwidth is known, it can be directly passed, i.e. TruncatedKernel(2).

The examples below clarify the API, that is however relatively easy to use.

Long run variance of the regression coefficient

In the regression context, the function vcov does all the work:

vcov(::DataFrameRegressionModel, ::HAC; prewhite = true)

Consider the following artificial data (a regression with autoregressive error component):

using CovarianceMatrices
using DataFrames
n = 500
x = randn(n,5)
u = Array{Float64}(2*n)
u[1] = rand()
for j in 2:2*n
    u[j] = 0.78*u[j-1] + randn()
u = u[n+1:2*n]    
y = 0.1 + x*[0.2, 0.3, 0.0, 0.0, 0.5] + u            

df = DataFrame()
df[:y] = y
for j in enumerate([:x1, :x2, :x3, :x4, :x5])
    df[j[2]] = x[:,j[1]]

Using the data in df, the coefficient of the regression can be estimated using GLM

lm1 = glm(y~x1+x2+x3+x4+x5, df, Normal(), IdentityLink())

To get a consistent estimate of the long run variance of the estimated coefficients using a Quadratic Spectral kernel with automatic bandwidth selection à la Andrews

vcov(lm1, QuadraticSpectralKernel(Andrews), prewhite = false)

If one wants to estimate the long-time variance using the same kernel, but with a bandwidth selected à la Newey-West

vcov(lm1, QuadraticSpectralKernel(NeweyWest), prewhite = false)

The standard errors can be obtained by the stderr method

stderr(::DataFrameRegressionModel, ::HAC; prewhite::Bool)

Sometime is useful to access the bandwidth selected by the automatic procedures. This can be done using the optimalbw method

optimalbw(NeweyWest, QuadraticSpectralKernel, lm1; prewhite = false)
optimalbw(Andrews, QuadraticSpectralKernel, lm1; prewhite = false)

Long run variance of the average of the process

Sometime interest lies in estimating the long-run variance of the average of the process. At the moment this can be done by carrying out a regression on a constant (the sample mean of the realization of the process) and using vcov or stderr to obtain a consistent variance.

lm2 = glm(u~1, df, Normal(), IdentityLink())
vcov(lm1, ParzenKernel(NeweyWest), prewhite = false)
stderr(lm1, ParzenKernel(NeweyWest), prewhite = false)

HC (Heteroskedastic consistent)

As in the HAC case, vcov and stderr are the main functions. They know get as argument the type of robust variance being sought

vcov(::DataFrameRegressionModel, ::HC)

Where HC is an abstract type with the following concrete types:

  • HC0
  • HC1 (this is HC0 with the degree of freedom adjustment)
  • HC2
  • HC3
  • HC4
  • HC4m
  • HC5 html using CovarianceMatrices using DataFrames using GLM # A Gamma example, from McCullagh & Nelder (1989, pp. 300-2) # The weights are added just to test the interface and are not part # of the original data clotting = DataFrame( u = log([5,10,15,20,30,40,60,80,100]), lot1 = [118,58,42,35,27,25,21,19,18], lot2 = [69,35,26,21,18,16,13,12,12], w = 9*[1/8, 1/9, 1/25, 1/6, 1/14, 1/25, 1/15, 1/13, 0.3022039] ) wOLS = fit(GeneralizedLinearModel, lot1~u, clotting, Normal(), wts = array(clotting[:w])) vcov(wOLS, HC0 vcov(wOLS, HC1) vcov(wOLS, HC2) vcov(wOLS, HC3) vcov(wOLS, HC4) vcov(wOLS, HC4m) vcov(wOLS, HC5)

CRHC (Cluster robust heteroskedasticty consistent)

The API of this class of variance estimators is subject to change, so please use with care. The difficulty is that CRHC type needs to have access to the variable along which dimension the clustering mast take place. For the moment, the following approach works --- as long as no missing values are present in the original dataframe.

using RDatasets
df = dataset("plm", "Grunfeld")
lm = glm(Inv~Value+Capital, df, Normal(), IdentityLink())
vcov(lm, CRHC1(convert(Array, df[:Firm])))
stderr(lm, CRHC1(convert(Array, df[:Firm])))

First Commit


Last Touched

about 2 months ago


196 commits

Used By: