Sophie

Sophie

distrib > Mandriva > 2010.2 > x86_64 > by-pkgid > 41809f14a7dd5b40dc6105b730645014 > files > 155

gretl-1.8.6-2mdv2010.1.x86_64.rpm

\chapter{Robust covariance matrix estimation}
\label{chap-robust-vcv}

\section{Introduction}
\label{vcv-intro}

Consider (once again) the linear regression model
%
\begin{equation}
\label{eq:ols-again}
y = X\beta + u
\end{equation}
%
where $y$ and $u$ are $T$-vectors, $X$ is a $T \times k$ matrix of
regressors, and $\beta$ is a $k$-vector of parameters.  As is well
known, the estimator of $\beta$ given by Ordinary Least Squares (OLS)
is
%
\begin{equation}
\label{eq:ols-betahat}
\hat{\beta} = (X'X)^{-1} X'y
\end{equation}
%
If the condition $E(u|X) = 0$ is satisfied, this is an unbiased
estimator; under somewhat weaker conditions the estimator is biased
but consistent.  It is straightforward to show that when the OLS
estimator is unbiased (that is, when $E(\hat{\beta}-\beta) = 0$), its
variance is
%
\begin{equation}
\label{eq:ols-varb}
\mbox{Var}(\hat{\beta}) = 
  E\left((\hat{\beta}-\beta)(\hat{\beta}-\beta)'\right) 
  = (X'X)^{-1} X' \Omega X (X'X)^{-1}
\end{equation}
%
where $\Omega = E(uu')$ is the covariance matrix of the error terms.

Under the assumption that the error terms are independently and
identically distributed (iid) we can write $\Omega = \sigma^2 I$,
where $\sigma^2$ is the (common) variance of the errors (and the
covariances are zero).  In that case (\ref{eq:ols-varb}) simplifies to
the ``classical'' formula,
%
\begin{equation}
\label{eq:ols-classical-varb}
\mbox{Var}(\hat{\beta}) = \sigma^2(X'X)^{-1}
\end{equation}

If the iid assumption is not satisfied, two things follow.  First, it
is possible in principle to construct a more efficient estimator than
OLS --- for instance some sort of Feasible Generalized Least Squares
(FGLS).  Second, the simple ``classical'' formula for the variance of
the least squares estimator is no longer correct, and hence the
conventional OLS standard errors --- which are just the square roots
of the diagonal elements of the matrix defined by
(\ref{eq:ols-classical-varb}) --- do not provide valid means of
statistical inference.

In the recent history of econometrics there are broadly two approaches
to the problem of non-iid errors.  The ``traditional'' approach is to
use an FGLS estimator.  For example, if the departure from the iid
condition takes the form of time-series dependence, and if one
believes that this could be modeled as a case of first-order
autocorrelation, one might employ an AR(1) estimation method such as
Cochrane--Orcutt, Hildreth--Lu, or Prais--Winsten.  If the problem is
that the error variance is non-constant across observations, one might
estimate the variance as a function of the independent variables and
then perform weighted least squares, using as weights the reciprocals
of the estimated variances.

While these methods are still in use, an alternative approach has
found increasing favor: that is, use OLS but compute standard errors
(or more generally, covariance matrices) that are robust with respect
to deviations from the iid assumption.  This is typically combined
with an emphasis on using large datasets --- large enough that the
researcher can place some reliance on the (asymptotic) consistency
property of OLS.  This approach has been enabled by the availability
of cheap computing power.  The computation of robust standard errors
and the handling of very large datasets were daunting tasks at one
time, but now they are unproblematic.  The other point favoring the
newer methodology is that while FGLS offers an efficiency advantage in
principle, it often involves making additional statistical assumptions
which may or may not be justified, which may not be easy to test
rigorously, and which may threaten the consistency of the estimator
--- for example, the ``common factor restriction'' that is implied by
traditional FGLS ``corrections'' for autocorrelated errors.

James Stock and Mark Watson's \textit{Introduction to Econometrics}
illustrates this approach at the level of undergraduate instruction:
many of the datasets they use comprise thousands or tens of thousands
of observations; FGLS is downplayed; and robust standard errors are
reported as a matter of course.  In fact, the discussion of the
classical standard errors (labeled ``homoskedasticity-only'') is
confined to an Appendix.

Against this background it may be useful to set out and discuss all
the various options offered by \app{gretl} in respect of robust
covariance matrix estimation.  The first point to notice is that
\app{gretl} produces ``classical'' standard errors by default (in all
cases apart from GMM estimation).  In script mode you can get robust
standard errors by appending the \option{robust} flag to estimation
commands.  In the GUI program the model specification dialog usually
contains a ``Robust standard errors'' check box, along with a
``configure'' button that is activated when the box is checked.  The
configure button takes you to a configuration dialog (which can also
be reached from the main menu bar: Tools $\rightarrow$ Preferences
$\rightarrow$ General $\rightarrow$ HCCME).  There you can select from
a set of possible robust estimation variants, and can also choose to
make robust estimation the default.

The specifics of the available options depend on the nature of the
data under consideration --- cross-sectional, time series or panel ---
and also to some extent the choice of estimator.  (Although we
introduced robust standard errors in the context of OLS above, they
may be used in conjunction with other estimators too.)  The following
three sections of this chapter deal with matters that are specific to
the three sorts of data just mentioned.  Note that additional details
regarding covariance matrix estimation in the context of GMM are given
in chapter~\ref{chap:gmm}.

We close this introduction with a brief statement of what ``robust
standard errors'' can and cannot achieve.  They can provide for
asymptotically valid statistical inference in models that are
basically correctly specified, but in which the errors are not iid.
The ``asymptotic'' part means that they may be of little use in small
samples.  The ``correct specification'' part means that they are not a
magic bullet: if the error term is correlated with the regressors, so
that the parameter estimates themselves are biased and inconsistent,
robust standard errors will not save the day.


\section{Cross-sectional data and the HCCME}
\label{vcv-hccme}

With cross-sectional data, the most likely departure from iid errors
is heteroskedasticity (non-constant variance).\footnote{In some
  specialized contexts spatial autocorrelation may be an issue.  Gretl
  does not have any built-in methods to handle this and we will not
  discuss it here.}  In some cases one may be able to arrive at a
judgment regarding the likely form of the heteroskedasticity, and
hence to apply a specific correction.  The more common case, however,
is where the heteroskedasticity is of unknown form.  We seek an
estimator of the covariance matrix of the parameter estimates that
retains its validity, at least asymptotically, in face of unspecified
heteroskedasticity.  It is not obvious, a priori, that this should be
possible, but White (1980) showed that
%
\begin{equation}
\label{eq:ols-varb-h}
\widehat{\mbox{Var}}_{\rm h}(\hat{\beta}) = 
       (X'X)^{-1} X' \hat{\Omega} X (X'X)^{-1}
\end{equation}
%
does the trick. (As usual in statistics, we need to say ``under
certain conditions'', but the conditions are not very restrictive.)
$\hat{\Omega}$ is in this context a diagonal matrix, whose non-zero
elements may be estimated using squared OLS residuals.  White referred
to (\ref{eq:ols-varb-h}) as a heteroskedasticity-consistent covariance
matrix estimator (HCCME).

Davidson and MacKinnon (2004, chapter 5) offer a useful discussion of
several variants on White's HCCME theme. They refer to the original
variant of (\ref{eq:ols-varb-h}) --- in which the diagonal elements of
$\hat{\Omega}$ are estimated directly by the squared OLS residuals,
$\hat{u}^2_t$ --- as HC$_0$.  (The associated standard errors are
often called ``White's standard errors''.)  The various refinements of
White's proposal share a common point of departure, namely the idea
that the squared OLS residuals are likely to be ``too small'' on
average.  This point is quite intuitive.  The OLS parameter estimates,
$\hat{\beta}$, satisfy by design the criterion that the sum of
squared residuals,
%
\[
\sum \hat{u}^2_t = \sum \left( y_t - X_t \hat{\beta} \right)^2
\]
%
is minimized for given $X$ and $y$.  Suppose that $\hat{\beta} \neq
\beta$.  This is almost certain to be the case: even if OLS is not
biased, it would be a miracle if the $\hat{\beta}$ calculated from any
finite sample were exactly equal to $\beta$.  But in that case the sum
of squares of the true, unobserved \textit{errors}, $\sum u^2_t = \sum
(y_t - X_t \beta)^2$ is bound to be greater than $\sum \hat{u}^2_t$.
The elaborated variants on HC$_0$ take this point on board as follows:
%
\begin{itemize}
\item HC$_1$: Applies a degrees-of-freedom correction, multiplying the
  HC$_0$ matrix by $T/(T-k)$.
\item HC$_2$: Instead of using $\hat{u}^2_t$ for the diagonal elements
  of $\hat{\Omega}$, uses $\hat{u}^2_t/(1-h_t)$, where $h_t =
  X_t(X'X)^{-1}X'_t$, the $t^{\rm th}$ diagonal element of the projection
  matrix, $P$, which has the property that $P\cdot y = \hat{y}$.  The
  relevance of $h_t$ is that if the variance of all the $u_t$ is
  $\sigma^2$, the expectation of $\hat{u}^2_t$ is $\sigma^2(1-h_t)$,
  or in other words, the ratio $\hat{u}^2_t/(1-h_t)$ has expectation
  $\sigma^2$.  As Davidson and MacKinnon show, $0\leq h_t <1$ for all
  $t$, so this adjustment cannot reduce the the diagonal elements of
  $\hat{\Omega}$ and in general revises them upward.
\item HC$_3$: Uses $\hat{u}^2_t/(1-h_t)^2$.  The additional factor of
  $(1-h_t)$ in the denominator, relative to HC$_2$, may be justified
  on the grounds that observations with large variances tend to exert
  a lot of influence on the OLS estimates, so that the corresponding
  residuals tend to be under-estimated.  See Davidson and MacKinnon
  for a fuller explanation.
\end{itemize}

The relative merits of these variants have been explored by means of
both simulations and theoretical analysis.  Unfortunately there is not
a clear consensus on which is ``best''.  Davidson and MacKinnon argue
that the original HC$_0$ is likely to perform worse than the others;
nonetheless, ``White's standard errors'' are reported more often than
the more sophisticated variants and therefore, for reasons of
comparability, HC$_0$ is the default HCCME in \app{gretl}.

If you wish to use HC$_1$, HC$_2$ or HC$_3$ you can arrange for this
in either of two ways.  In script mode, you can do, for example,
%
\begin{code}
set hc_version 2
\end{code}
%
In the GUI program you can go to the HCCME configuration dialog, as
noted above, and choose any of these variants to be the default.


\section{Time series data and HAC covariance matrices}
\label{vcv-hac}

Heteroskedasticity may be an issue with time series data too, but it
is unlikely to be the only, or even the primary, concern.  

One form of heteroskedasticity is common in macroeconomic time series,
but is fairly easily dealt with.  That is, in the case of strongly
trending series such as Gross Domestic Product, aggregate consumption,
aggregate investment, and so on, higher levels of the variable in
question are likely to be associated with higher variability in
absolute terms.  The obvious ``fix'', employed in many
macroeconometric studies, is to use the logs of such series rather
than the raw levels.  Provided the \textit{proportional} variability
of such series remains roughly constant over time, the log
transformation is effective.  

Other forms of heteroskedasticity may resist the log transformation,
but may demand a special treatment distinct from the calculation of
robust standard errors.  We have in mind here \textit{autoregressive
  conditional} heteroskedasticity, for example in the behavior of
asset prices, where large disturbances to the market may usher in
periods of increased volatility.  Such phenomena call for specific
estimation strategies, such as GARCH (see chapter~\ref{chap:timeser}).

Despite the points made above, some residual degree of
heteroskedasticity may be present in time series data: the key point
is that in most cases it is likely to be combined with serial
correlation (autocorrelation), hence demanding a special treatment.
In White's approach, $\hat{\Omega}$, the estimated covariance matrix
of the $u_t$, remains conveniently diagonal: the variances,
$E(u^2_t)$, may differ by $t$ but the covariances, $E(u_t u_s)$, are
all zero.  Autocorrelation in time series data means that at least
some of the the off-diagonal elements of $\hat{\Omega}$ should be
non-zero.  This introduces a substantial complication and requires
another piece of terminology; estimates of the covariance matrix that
are asymptotically valid in face of both heteroskedasticity and
autocorrelation of the error process are termed HAC
(heteroskedasticity and autocorrelation consistent).

The issue of HAC estimation is treated in more technical terms in
chapter~\ref{chap:gmm}.  Here we try to convey some of the intuition
at a more basic level.  We begin with a general comment: residual
autocorrelation is not so much a property of the data, as a symptom of
an inadequate model.  Data may be persistent though time, and if we fit
a model that does not take this aspect into account properly, we end
up with a model with autocorrelated disturbances.  Conversely, it is
often possible to mitigate or even eliminate the problem of
autocorrelation by including relevant lagged variables in a time
series model, or in other words, by specifying the dynamics of the
model more fully.  HAC estimation should \textit{not} be seen as the
first resort in dealing with an autocorrelated error process.

That said, the ``obvious'' extension of White's HCCME to the case of
autocorrelated errors would seem to be this: estimate the off-diagonal
elements of $\hat{\Omega}$ (that is, the autocovariances, $E(u_t
u_s)$) using, once again, the appropriate OLS residuals:
$\hat{\omega}_{ts} = \hat{u}_t \hat{u}_s$.  This is basically right,
but demands an important amendment.  We seek a \textit{consistent}
estimator, one that converges towards the true $\Omega$ as the sample
size tends towards infinity.  This can't work if we allow unbounded
serial dependence.  Bigger samples will enable us to estimate more of
the true $\omega_{ts}$ elements (that is, for $t$ and $s$ more widely
separated in time) but will \textit{not} contribute ever-increasing
information regarding the maximally separated $\omega_{ts}$ pairs,
since the maximal separation itself grows with the sample size.  To
ensure consistency, we have to confine our attention to processes
exhibiting temporally limited dependence, or in other words cut off
the computation of the $\hat{\omega}_{ts}$ values at some maximum
value of $p = t-s$ (where $p$ is treated as an increasing function of
the sample size, $T$, although it cannot increase in proportion to
$T$).

The simplest variant of this idea is to truncate the computation at
some finite lag order $p$, where $p$ grows as, say, $T^{1/4}$.  The
trouble with this is that the resulting $\hat{\Omega}$ may not be a
positive definite matrix.  In practical terms, we may end up with
negative estimated variances.  One solution to this problem is offered
by The Newey--West estimator (Newey and West, 1987), which assigns
declining weights to the sample autocovariances as the temporal
separation increases.

To understand this point it is helpful to look more closely at the
covariance matrix given in (\ref{eq:ols-varb-h}), namely,
%
\[
(X'X)^{-1} (X' \hat{\Omega} X) (X'X)^{-1}
\]
%
This is known as a ``sandwich'' estimator.  The bread, which appears
on both sides, is $(X'X)^{-1}$.  This is a $k \times k$ matrix, and is
also the key ingredient in the computation of the classical covariance
matrix.  The filling in the sandwich is
%
\[
\begin{array}{ccccc}
\hat{\Sigma} & = & X' & \hat{\Omega} & X \\
{\scriptstyle (k \times k)} & &
{\scriptstyle (k \times T)} & {\scriptstyle (T \times T)} & 
  {\scriptstyle (T \times k)}
\end{array}
\]
%
Since $\Omega = E(uu')$, the matrix being estimated here can also be
written as
\[
\Sigma = E(X'u\,u'X)
\]
%
which expresses $\Sigma$ as the long-run covariance of the random
$k$-vector $X'u$.   

From a computational point of view, it is not necessary or desirable
to store the (potentially very large) $T \times T$ matrix
$\hat{\Omega}$ as such.  Rather, one computes the sandwich filling 
by summation as
%
\[
\hat{\Sigma} = \hat{\Gamma}(0) + \sum_{j=1}^p w_j 
  \left(\hat{\Gamma}(j) + \hat{\Gamma}'(j) \right)
\]
%
where the $k \times k$ sample autocovariance matrix $\hat{\Gamma}(j)$,
for $j \geq 0$, is given by
\[
\hat{\Gamma}(j) = \frac{1}{T} \sum_{t=j+1}^T
  \hat{u}_t \hat{u}_{t-j}\, X'_t\, X_{t-j}
\]
and $w_j$ is the weight given to the autocovariance at lag $j > 0$.

This leaves two questions.  How exactly do we determine the maximum
lag length or ``bandwidth'', $p$, of the HAC estimator?  And how
exactly are the weights $w_j$ to be determined?  We will return to the
(difficult) question of the bandwidth shortly.  As regards the
weights, \app{Gretl} offers three variants.  The default is the
Bartlett kernel, as used by Newey and West.  This sets
\[
w_j = \left\{ \begin{array}{cc}
     1 - \frac{j}{p+1} & j \leq p \\
     0 & j > p
     \end{array}
    \right.
\]
so the weights decline linearly as $j$ increases.  The other two
options are the Parzen kernel and the Quadratic Spectral (QS) kernel.
For the Parzen kernel,
\[
w_j = \left\{ \begin{array}{cc}
    1 - 6a_j^2 + 6a_j^3 & 0 \leq a_j \leq 0.5 \\
    2(1 - a_j)^3 & 0.5 < a_j \leq 1 \\
    0 & a_j > 1
    \end{array}
    \right.
\]
where $a_j = j/(p+1)$, and for the QS kernel,
\[
w_j = \frac{25}{12\pi^2 d_j^2} 
   \left(\frac{\sin{m_j}}{m_j} - \cos{m_j} \right)
\]
where $d_j = j/p$ and $m_j = 6\pi d_i/5$.  

Figure~\ref{fig:kernels} shows the weights generated by these kernels,
for $p=4$ and $j$ = 1 to 9.

\begin{figure}[htbp]
\caption{Three HAC kernels}
\label{fig:kernels}
\centering
\includegraphics{figures/kernels}
\end{figure}

In \app{gretl} you select the kernel using the \texttt{set} command
with the \verb|hac_kernel| parameter:
%
\begin{code}
set hac_kernel parzen
set hac_kernel qs
set hac_kernel bartlett
\end{code}

\subsection{Selecting the HAC bandwidth}
\label{sec:hac-bw}

The asymptotic theory developed by Newey, West and others tells us in
general terms how the HAC bandwidth, $p$, should grow with the sample
size, $T$ --- that is, $p$ should grow in proportion to some
fractional power of $T$.  Unfortunately this is of little help to the
applied econometrician, working with a given dataset of fixed size.
Various rules of thumb have been suggested, and \app{gretl} implements
two such.  The default is $p = 0.75 T^{1/3}$, as recommended by Stock
and Watson (2003).  An alternative is $p = 4(T/100)^{2/9}$, as in
Wooldridge (2002b).  In each case one takes the integer part of the
result.  These variants are labeled \texttt{nw1} and \texttt{nw2}
respectively, in the context of the \texttt{set} command with the
\verb|hac_lag| parameter.  That is, you can switch to the version
given by Wooldridge with
%
\begin{code}
set hac_lag nw2
\end{code}
%
As shown in Table~\ref{tab:haclag} the choice between \texttt{nw1} and
\texttt{nw2} does not make a great deal of difference.

\begin{table}[htbp]
  \centering
  \begin{tabular}{ccc}
    $T$ & $p$ (\texttt{nw1}) & $p$ (\texttt{nw2}) \\[4pt]
50& 	2& 	3 \\
100& 	3& 	4 \\
150& 	3& 	4 \\
200& 	4& 	4 \\
300& 	5& 	5 \\
400& 	5& 	5 \\
  \end{tabular}
\caption{HAC bandwidth: two rules of thumb}
\label{tab:haclag}
\end{table}

You also have the option of specifying a fixed numerical value for
$p$, as in
%
\begin{code}
set hac_lag 6
\end{code}
%
In addition you can set a distinct bandwidth for use with the
Quadratic Spectral kernel (since this need not be an integer).  For
example,
%
\begin{code}
set qs_bandwidth 3.5
\end{code}


\subsection{Prewhitening and data-based bandwidth selection}
\label{sec:hac-prewhiten}

An alternative approach is to deal with residual autocorrelation by
attacking the problem from two sides. The intuition behind the
technique known as \emph{VAR prewhitening} (Andrews and Monahan, 1992)
can be illustrated by a simple example.  Let $x_t$ be a sequence of
first-order autocorrelated random variables
%
\[
  x_t = \rho x_{t-1} + u_t
\]
%
The long-run variance of $x_t$ can be shown to be
%
\[
  V_{LR}(x_t) = \frac{V_{LR}(u_t)}{(1-\rho)^2}
\]
%
In most cases, $u_t$ is likely to be less autocorrelated than $x_t$,
so a smaller bandwidth should suffice.  Estimation of $V_{LR}(x_t)$ can
therefore proceed in three steps: (1) estimate $\rho$; (2) obtain a
HAC estimate of $\hat{u}_t = x_t - \hat{\rho} x_{t-1}$; and (3)
divide the result by $(1-\rho)^2$.

The application of the above concept to our problem implies estimating
a finite-order Vector Autoregression (VAR) on the vector variables
$\xi_t = X_t \hat{u}_t$. In general, the VAR can be of any order, but
in most cases 1 is sufficient; the aim is not to build a watertight
model for $\xi_t$, but just to ``mop up'' a substantial part of the
autocorrelation.  Hence, the following VAR is estimated
%
\[
  \xi_t = A \xi_{t-1} + \varepsilon_t
\]
%
Then an estimate of the matrix $X'\Omega X$ can be recovered via
\[
  (I- \hat{A})^{-1} \hat{\Sigma}_{\varepsilon} (I- \hat{A}')^{-1}
\]
where $\hat{\Sigma}_{\varepsilon}$ is any HAC estimator, applied to the VAR
residuals.

You can ask for prewhitening in \app{gretl} using
%
\begin{code}
set hac_prewhiten on
\end{code}
%
There is at present no mechanism for specifying an order other than 1
for the initial VAR.

A further refinement is available in this context, namely data-based
bandwidth selection.  It makes intuitive sense that the HAC bandwidth
should not simply be based on the size of the sample, but should
somehow take into account the time-series properties of the data (and
also the kernel chosen).  A nonparametric method for doing this was
proposed by Newey and West (1994); a good concise account of the
method is given in Hall (2005).  This option can be invoked in gretl
via
%
\begin{code}
set hac_lag nw3
\end{code}
%
This option is the default when prewhitening is selected, but you can
override it by giving a specific numerical value for \verb|hac_lag|.

Even the Newey--West data-based method does not fully pin down the
bandwidth for any particular sample.  The first step involves 
calculating a series of residual covariances.  The length of this
series is given as a function of the sample size, but only up to a
scalar multiple --- for example, it is given as $O(T^{2/9})$ for the
Bartlett kernel.  \app{Gretl} uses an implied multiple of 1.  


\section{Special issues with panel data}
\label{sec:vcv-panel}

Since panel data have both a time-series and a cross-sectional
dimension one might expect that, in general, robust estimation of the
covariance matrix would require handling both heteroskedasticity and
autocorrelation (the HAC approach).  In addition, some special features of
panel data require attention.
\begin{itemize}
\item The variance of the error term may differ across the
  cross-sectional units.
\item The covariance of the errors across the units may be non-zero in
  each time period.
\item If the ``between'' variation is not removed, the errors may
  exhibit autocorrelation, not in the usual time-series sense but in
  the sense that the mean error for unit $i$ may differ from that of
  unit $j$.  (This is particularly relevant when estimation
  is by pooled OLS.)
\end{itemize}

\app{Gretl} currently offers two robust covariance matrix estimators
specifically for panel data.  These are available for models estimated
via fixed effects, pooled OLS, and pooled two-stage least
squares.  The default robust estimator is that suggested by Arellano
(2003), which is HAC provided the panel is of the ``large $n$, small
$T$'' variety (that is, many units are observed in relatively few
periods).  The Arellano estimator is
\[
\hat{\Sigma}_{\rm A} = 
\left(X^{\prime}X\right)^{-1}
\left( \sum_{i=1}^n X_i^{\prime} \hat{u}_i 
    \hat{u}_i^{\prime} X_i \right)
\left(X^{\prime}X\right)^{-1}
\]
where $X$ is the matrix of regressors (with the group means
subtracted, in the case of fixed effects) $\hat{u}_i$ denotes the
vector of residuals for unit $i$, and $n$ is the number of
cross-sectional units.  Cameron and Trivedi (2005) make a strong case
for using this estimator; they note that the ordinary White HCCME
can produce misleadingly small standard errors in the panel context
because it fails to take autocorrelation into account.

In cases where autocorrelation is not an issue, however, the estimator
proposed by Beck and Katz (1995) and discussed by Greene (2003,
chapter 13) may be appropriate.  This estimator, which takes into
account contemporaneous correlation across the units and
heteroskedasticity by unit, is
\[
\hat{\Sigma}_{\rm BK} = 
\left(X^{\prime}X\right)^{-1}
\left( \sum_{i=1}^n \sum_{j=1}^n \hat{\sigma}_{ij} X^{\prime}_iX_j \right)
\left(X^{\prime}X\right)^{-1}
\]
The covariances $\hat{\sigma}_{ij}$ are estimated via
\[
\hat{\sigma}_{ij} = \frac{\hat{u}^{\prime}_i \hat{u}_j}{T}
\]
where $T$ is the length of the time series for each unit.  Beck and
Katz call the associated standard errors ``Panel-Corrected Standard
Errors'' (PCSE).  This estimator can be invoked in \app{gretl} via the
command 
%
\begin{code}
set pcse on
\end{code}
%
The Arellano default can be re-established via 
%
\begin{code}
set pcse off
\end{code}
%
(Note that regardless of the \texttt{pcse} setting, the robust
estimator is not used unless the \option{robust} flag is given, or the
``Robust'' box is checked in the GUI program.)

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "gretl-guide"
%%% End: