Sophie: gretl-1.9.4-1 x86

gretl-1.9.4-1.x86_64.rpm

\chapter{Panel data}
\label{chap-panel}

\section{Estimation of panel models}

\subsection{Pooled Ordinary Least Squares}
\label{pooled-est}

The simplest estimator for panel data is pooled OLS.  In most cases
this is unlikely to be adequate, but it provides a baseline for
comparison with more complex estimators.

If you estimate a model on panel data using OLS an additional test
item becomes available.  In the GUI model window this is the item
``panel diagnostics'' under the \textsf{Tests} menu; the script
counterpart is the \cmd{hausman} command.

To take advantage of this test, you should specify a model without any
dummy variables representing cross-sectional units.  The test compares
pooled OLS against the principal alternatives, the fixed effects and
random effects models.  These alternatives are explained in the
following section.

\subsection{The fixed and random effects models}
\label{panel-est}

In \app{gretl} version 1.6.0 and higher, the fixed and random effects
models for panel data can be estimated in their own right.  In the
graphical interface these options are found under the menu item
``Model/Panel/Fixed and random effects''.  In the command-line
interface one uses the \cmd{panel} command, with or without the
\option{random-effects} option.

This section explains the nature of these models and comments on their
estimation via \app{gretl}.

The pooled OLS specification may be written as 
\begin{equation}
\label{eq:pooled}
y_{it} = X_{it}\beta + u_{it}
\end{equation}
where $y_{it}$ is the observation on the dependent variable for
cross-sectional unit $i$ in period $t$, $X_{it}$ is a $1\times k$
vector of independent variables observed for unit $i$ in period $t$,
$\beta$ is a $k\times 1$ vector of parameters, and $u_{it}$ is an error
or disturbance term specific to unit $i$ in period $t$.

The fixed and random effects models have in common that they decompose
the unitary pooled error term, $u_{it}$.  For the \textsl{fixed effects}
model we write $u_{it} = \alpha_i + \varepsilon_{it}$, yielding
\begin{equation}
\label{eq:FE}
y_{it} = X_{it}\beta + \alpha_i + \varepsilon_{it}
\end{equation}
That is, we decompose $u_{it}$ into a unit-specific and time-invariant
component, $\alpha_i$, and an observation-specific error,
$\varepsilon_{it}$.\footnote{It is possible to break a third component
  out of $u_{it}$, namely $w_t$, a shock that is time-specific but
  common to all the units in a given period.  In the interest of
  simplicity we do not pursue that option here.}  The $\alpha_i$s are
then treated as fixed parameters (in effect, unit-specific
$y$-intercepts), which are to be estimated.  This can be done by
including a dummy variable for each cross-sectional unit (and
suppressing the global constant).  This is sometimes called the Least
Squares Dummy Variables (LSDV) method.  Alternatively, one can subtract
the group mean from each of variables and estimate a model without a
constant.  In the latter case the dependent variable may be written as
\[
\tilde{y}_{it} = y_{it} - \bar{y}_i
\]
The ``group mean'', $\bar{y}_i$, is defined as
\[
\bar{y}_i = \frac{1}{T_i} \sum_{t=1}^{T_i} y_{it}
\]
where $T_i$ is the number of observations for unit $i$.  An exactly
analogous formulation applies to the independent variables.  Given
parameter estimates, $\hat{\beta}$, obtained using such de-meaned data
we can recover estimates of the $\alpha_i$s using
\[
\hat{\alpha}_i = \frac{1}{T_i} \sum_{t=1}^{T_i} 
   \left(y_{it} - X_{it}\hat{\beta}\right)
\]

These two methods (LSDV, and using de-meaned data) are numerically
equivalent. \app{Gretl} takes the approach of de-meaning the data.  If
you have a small number of cross-sectional units, a large number of
time-series observations per unit, and a large number of regressors,
it is more economical in terms of computer memory to use LSDV.  If 
need be you can easily implement this manually.  For example,
%
\begin{code}
genr unitdum
ols y x du_*
\end{code}
%
(See Chapter~\ref{chap-genr} for details on \texttt{unitdum}).

The $\hat{\alpha}_i$ estimates are not printed as part of the standard
model output in \app{gretl} (there may be a large number of these, and
typically they are not of much inherent interest).  However you can
retrieve them after estimation of the fixed effects model if you
wish.  In the graphical interface, go to the ``Save'' menu in the
model window and select ``per-unit constants''.  In command-line mode,
you can do \texttt{genr} \textsl{newname} = \dollar{ahat}, where
\textsl{newname} is the name you want to give the series. 

For the \textsl{random effects} model we write $u_{it} = v_i +
\varepsilon_{it}$, so the model becomes
\begin{equation}
\label{eq:RE}
y_{it} = X_{it}\beta + v_i + \varepsilon_{it}
\end{equation}
In contrast to the fixed effects model, the $v_i$s are not treated as
fixed parameters, but as random drawings from a given probability
distribution.

The celebrated Gauss--Markov theorem, according to which OLS is the
best linear unbiased estimator (BLUE), depends on the assumption that
the error term is independently and identically distributed (IID).  In
the panel context, the IID assumption means that $E(u_{it}^2)$, in
relation to equation~\ref{eq:pooled}, equals a constant, $\sigma^2_u$,
for all $i$ and $t$, while the covariance $E(u_{is} u_{it})$ equals
zero for all $s \neq t$ and the covariance $E(u_{jt} u_{it})$ equals
zero for all $j \neq i$.

If these assumptions are not met --- and they are unlikely to be met
in the context of panel data --- OLS is not the most efficient
estimator.  Greater efficiency may be gained using generalized least
squares (GLS), taking into account the covariance structure of the
error term.  

Consider observations on a given unit $i$ at two different times
$s$ and $t$. From the hypotheses above it can be worked out that
$\mbox{Var}(u_{is}) = \mbox{Var}(u_{it}) = \sigma^2_{v} +
\sigma^2_{\varepsilon}$, while the covariance between $u_{is}$ and
$u_{it}$ is given by $E(u_{is}u_{it}) = \sigma^2_{v}$.

In matrix notation, we may group all the $T_i$ observations for unit
$i$ into the vector $\mathbf{y}_i$ and write it as
\begin{equation}
\label{eq:REvec}
\mathbf{y}_{i} = \mathbf{X}_{i} \beta + \mathbf{u}_i
\end{equation}
The vector $\mathbf{u}_i$, which includes all the disturbances for
individual $i$, has a variance--covariance matrix given by
\begin{equation}
\label{eq:CovMatUnitI}
  \mbox{Var}(\mathbf{u}_i) = \Sigma_i = \sigma^2_{\varepsilon} I + \sigma^2_{v} J
\end{equation}
where $J$ is a square matrix with all elements equal to 1. It can be
shown that the matrix
\[
  K_i = I - \frac{\theta}{T_i} J,
\]
where $\theta = 1 -
\sqrt{\frac{\sigma^2_{\varepsilon}}{\sigma^2_{\varepsilon} + T_i
    \sigma^2_{v}}}$, has the property
\[
  K_i \Sigma K_i' = \sigma^2_{\varepsilon} I
\]
It follows that the transformed system
\begin{equation}
\label{eq:REGLS}
K_i \mathbf{y}_{i} = K_i \mathbf{X}_{i} \beta + K_i \mathbf{u}_i
\end{equation}
satisfies the Gauss--Markov conditions, and OLS estimation of
(\ref{eq:REGLS}) provides efficient inference. But since 
\[
  K_i \mathbf{y}_{i} = \mathbf{y}_{i} - \theta \bar{\mathbf{y}}_{i}
\]
GLS estimation is equivalent to OLS using ``quasi-demeaned''
variables; that is, variables from which we subtract a fraction
$\theta$ of their average. Notice that for $\sigma^2_{\varepsilon} \to
0$, $\theta \to 1$, while for $\sigma^2_{v} \to 0$, $\theta \to 0$.
This means that if all the variance is attributable to the individual
effects, then the fixed effects estimator is optimal; if, on the other
hand, individual effects are negligible, then pooled OLS turns out,
unsurprisingly, to be the optimal estimator.

To implement the GLS approach we need to calculate $\theta$, which in
turn requires estimates of the variances $\sigma^2_{\varepsilon}$ and
$\sigma^2_v$.  (These are often referred to as the ``within'' and
``between'' variances respectively, since the former refers to
variation within each cross-sectional unit and the latter to variation
between the units).  Several means of estimating these magnitudes have
been suggested in the literature \citep[see][]{baltagi95}; \app{gretl} uses
the method of \cite{swamy72}: $\sigma^2_\varepsilon$ is
estimated by the residual variance from the fixed effects model, and
the sum $\sigma^2_\varepsilon + T_i \sigma^2_v$ is estimated as $T_i$
times the residual variance from the ``between'' estimator,
\[
\bar{y}_i = \bar{X}_i \beta + e_i
\]
The latter regression is implemented by constructing a data set
consisting of the group means of all the relevant variables.


\subsection{Choice of estimator}
\label{panel-choice}

Which panel method should one use, fixed effects or random effects?

One way of answering this question is in relation to the nature of the
data set.  If the panel comprises observations on a fixed and
relatively small set of units of interest (say, the member states of
the European Union), there is a presumption in favor of fixed effects.
If it comprises observations on a large number of randomly selected
individuals (as in many epidemiological and other longitudinal
studies), there is a presumption in favor of random effects.

Besides this general heuristic, however, various statistical
issues must be taken into account.

\begin{enumerate}

\item Some panel data sets contain variables whose values are specific
  to the cross-sectional unit but which do not vary over time.  If you
  want to include such variables in the model, the fixed effects
  option is simply not available.  When the fixed effects approach is
  implemented using dummy variables, the problem is that the
  time-invariant variables are perfectly collinear with the per-unit
  dummies.  When using the approach of subtracting the group means,
  the issue is that after de-meaning these variables are nothing but
  zeros.
\item A somewhat analogous prohibition applies to the random effects
  estimator.  This estimator is in effect a matrix-weighted average of
  pooled OLS and the ``between'' estimator.  Suppose we have
  observations on $n$ units or individuals and there are $k$
  independent variables of interest.  If $k>n$, the ``between''
  estimator is undefined --- since we have only $n$ effective
  observations --- and hence so is the random effects estimator.
\end{enumerate}

If one does not fall foul of one or other of the prohibitions
mentioned above, the choice between fixed effects and random effects
may be expressed in terms of the two econometric \textit{desiderata},
efficiency and consistency.  

From a purely statistical viewpoint, we could say that there is a
tradeoff between robustness and efficiency. In the fixed effects
approach, we do not make any hypotheses on the ``group effects'' (that
is, the time-invariant differences in mean between the groups) beyond
the fact that they exist --- and that can be tested; see below. As a
consequence, once these effects are swept out by taking deviations
from the group means, the remaining parameters can be estimated.

On the other hand, the random effects approach attempts to model the
group effects as drawings from a probability distribution instead of
removing them. This requires that individual effects are representable
as a legitimate part of the disturbance term, that is, zero-mean
random variables, uncorrelated with the regressors.

As a consequence, the fixed-effects estimator ``always works'', but at
the cost of not being able to estimate the effect of time-invariant
regressors.  The richer hypothesis set of the random-effects estimator
ensures that parameters for time-invariant regressors can be
estimated, and that estimation of the parameters for time-varying
regressors is carried out more efficiently.  These advantages, though,
are tied to the validity of the additional hypotheses. If, for
example, there is reason to think that individual effects may be
correlated with some of the explanatory variables, then the
random-effects estimator would be inconsistent, while fixed-effects
estimates would still be valid.  It is precisely on this principle
that the Hausman test is built (see below): if the fixed- and
random-effects estimates agree, to within the usual statistical margin
of error, there is no reason to think the additional hypotheses
invalid, and as a consequence, no reason \textit{not} to use the more
efficient RE estimator.

\subsection{Testing panel models}
\label{panel-tests}

If you estimate a fixed effects or random effects model in the
graphical interface, you may notice that the number of items available
under the ``Tests'' menu in the model window is relatively limited.
Panel models carry certain complications that make it difficult to
implement all of the tests one expects to see for models estimated on
straight time-series or cross-sectional data.  

Nonetheless, various panel-specific tests are printed along with the
parameter estimates as a matter of course, as follows.

When you estimate a model using \textsl{fixed effects}, you
automatically get an $F$-test for the null hypothesis that the
cross-sectional units all have a common intercept.  That is to say
that all the $\alpha_i$s are equal, in which case the pooled model
(\ref{eq:pooled}), with a column of 1s included in the $X$ matrix, is
adequate.

When you estimate using \textsl{random effects}, the Breusch--Pagan
and Hausman tests are presented automatically.  

The Breusch--Pagan test is the counterpart to the $F$-test mentioned
above.  The null hypothesis is that the variance of $v_i$ in
equation (\ref{eq:RE}) equals zero; if this hypothesis is not rejected,
then again we conclude that the simple pooled model is adequate.

The Hausman test probes the consistency of the GLS estimates.  The
null hypothesis is that these estimates are consistent --- that is,
that the requirement of orthogonality of the $v_i$ and the $X_i$ is
satisfied.  The test is based on a measure, $H$, of the ``distance''
between the fixed-effects and random-effects estimates, constructed
such that under the null it follows the $\chi^2$ distribution with
degrees of freedom equal to the number of time-varying regressors in
the matrix $X$.  If the value of $H$ is ``large'' this suggests that
the random effects estimator is not consistent and the fixed-effects
model is preferable.

There are two ways of calculating $H$, the matrix-difference method
and the regression method.  The procedure for the matrix-difference
method is this:
\begin{itemize}
\item Collect the fixed-effects estimates in a vector
  $\tilde{\beta}$ and the corresponding random-effects estimates in
  $\hat{\beta}$, then form the difference vector $(\tilde{\beta} -
  \hat{\beta})$. 
\item Form the covariance matrix of the difference vector as
  $\mbox{Var}(\tilde{\beta} - \hat{\beta}) = \mbox{Var}(\tilde{\beta})
  - \mbox{Var}(\hat{\beta}) = \Psi$, where $\mbox{Var}(\tilde{\beta})$
  and $\mbox{Var}(\hat{\beta})$ are estimated by the sample variance
  matrices of the fixed- and random-effects models
  respectively.\footnote{\cite{hausman78} showed that the covariance of
    the difference takes this simple form when $\hat{\beta}$ is an
    efficient estimator and $\tilde{\beta}$ is inefficient.}
\item Compute $H = \left(\tilde{\beta} - \hat{\beta}\right)' \Psi^{-1}
   \left(\tilde{\beta} - \hat{\beta}\right)$.
\end{itemize}

Given the relative efficiencies of $\tilde{\beta}$ and $\hat{\beta}$,
the matrix $\Psi$ ``should be'' positive definite, in which case $H$ is
positive, but in finite samples this is not guaranteed and of course
a negative $\chi^2$ value is not admissible.  The regression method
avoids this potential problem.  The procedure is:
\begin{itemize}
\item Treat the random-effects model as the restricted model, and
  record its sum of squared residuals as SSR$_r$.
\item Estimate via OLS an unrestricted model in which the dependent
  variable is quasi-demeaned $y$ and the regressors include both
  quasi-demeaned $X$ (as in the RE model) and the de-meaned variants
  of all the time-varying variables (i.e.\ the fixed-effects
  regressors); record the sum of squared residuals from this model
  as SSR$_u$.
\item Compute $H = n \left(\mbox{SSR}_r - \mbox{SSR}_u\right) /
  \mbox{SSR}_u$, where $n$ is the total number of observations used.
  On this variant $H$ cannot be negative, since adding additional
  regressors to the RE model cannot raise the SSR.
\end{itemize}

By default \app{gretl} computes the Hausman test via the regression
method, but it uses the matrix-difference method if you pass the
option \option{matrix-diff} to the \cmd{panel} command.

\subsection{Robust standard errors}
\label{panel-robust}

For most estimators, \app{gretl} offers the option of computing an
estimate of the covariance matrix that is robust with respect to
heteroskedasticity and/or autocorrelation (and hence also robust
standard errors).  In the case of panel data, robust covariance matrix
estimators are available for the pooled and fixed effects model but
not currently for random effects.  Please see
section~\ref{sec:vcv-panel} for details.


\section{Autoregressive panel models}
\label{panel-auto}

Special problems arise when a lag of the dependent variable is
included among the regressors in a panel model.  Consider a dynamic
variant of the pooled model (\ref{eq:pooled}):
\begin{equation}
\label{eq:pooled-dyn}
y_{it} = X_{it}\beta + \rho y_{it-1} + u_{it}
\end{equation}
First, if the error $u_{it}$ includes a group effect, $v_i$, then
$y_{it-1}$ is bound to be correlated with the error, since the value
of $v_i$ affects $y_i$ at all $t$.  That means that OLS applied to
(\ref{eq:pooled-dyn}) will be inconsistent as well as inefficient.
The fixed-effects model sweeps out the group effects and so overcomes
this particular problem, but a subtler issue remains, which applies to
both fixed and random effects estimation.  Consider the de-meaned
representation of fixed effects, as applied to the dynamic model,
\[
\tilde{y}_{it} = \tilde{X}_{it}\beta + \rho \tilde{y}_{i,t-1} 
  + \varepsilon_{it}
\]
where $\tilde{y}_{it} = y_{it} - \bar{y}_i$ and $\varepsilon_{it} =
u_{it} - \bar{u}_i$ (or $u_{it} - \alpha_i$, using the notation of
equation~\ref{eq:FE}).  The trouble is that $\tilde{y}_{i,t-1}$ will be
correlated with $\varepsilon_{it}$ via the group mean, $\bar{y}_i$.
The disturbance $\varepsilon_{it}$ influences $y_{it}$ directly, which
influences $\bar{y}_i$, which, by construction, affects the value of
$\tilde{y}_{it}$ for all $t$.  The same issue arises in relation to
the quasi-demeaning used for random effects.  Estimators which ignore
this correlation will be consistent only as $T \to \infty$ (in which
case the marginal effect of $\varepsilon_{it}$ on the group mean of 
$y$ tends to vanish).  

One strategy for handling this problem, and producing consistent
estimates of $\beta$ and $\rho$, was proposed by
\cite{anderson-hsiao81}.  Instead of de-meaning the data, they suggest
taking the first difference of (\ref{eq:pooled-dyn}), an alternative
tactic for sweeping out the group effects:
\begin{equation}
\label{eq:fe-dyn}
\Delta y_{it} = \Delta X_{it}\beta + \rho \Delta y_{i,t-1} 
  + \eta_{it}
\end{equation}
where $\eta_{it} = \Delta u_{it} = \Delta(v_i + \varepsilon_{it}) =
\varepsilon_{it} - \varepsilon_{i,t-1}$.  We're not in the clear yet,
given the structure of the error $\eta_{it}$: the disturbance
$\varepsilon_{i,t-1}$ is an influence on both $\eta_{it}$ and $\Delta
y_{i,t-1} = y_{it} - y_{i,t-1}$.  The next step is then to find an
instrument for the ``contaminated'' $\Delta y_{i,t-1}$. Anderson and
Hsiao suggest using either $y_{i,t-2}$ or $\Delta y_{i,t-2}$, both of
which will be uncorrelated with $\eta_{it}$ provided that the
underlying errors, $\varepsilon_{it}$, are not themselves serially
correlated.

The Anderson--Hsiao estimator is not provided as a built-in function
in \app{gretl}, since \app{gretl}'s sensible handling of lags and
differences for panel data makes it a simple application of regression
with instrumental variables --- see Example~\ref{anderson-hsiao},
which is based on a study of country growth rates by
\cite{nerlove99}.\footnote{Also see Clint Cummins' benchmarks page,
  \url{http://www.stanford.edu/~clint/bench/}.}
 
\begin{script}[htbp]
  \caption{The Anderson--Hsiao estimator for a dynamic panel model}
  \label{anderson-hsiao}
\begin{scode}
# Penn World Table data as used by Nerlove
open penngrow.gdt
# Fixed effects (for comparison)
panel Y 0 Y(-1) X
# Random effects (for comparison)
panel Y 0 Y(-1) X --random-effects
# take differences of all variables
diff Y X
# Anderson-Hsiao, using Y(-2) as instrument
tsls d_Y d_Y(-1) d_X ; 0 d_X Y(-2)
# Anderson-Hsiao, using d_Y(-2) as instrument
tsls d_Y d_Y(-1) d_X ; 0 d_X d_Y(-2)
\end{scode}
\end{script}

Although the Anderson--Hsiao estimator is consistent, it is not most
efficient: it does not make the fullest use of the available
instruments for $\Delta y_{i,t-1}$, nor does it take into account the
differenced structure of the error $\eta_{it}$.  It is improved upon
by the methods of \cite{arellano-bond91} and \cite{blundell-bond98}.
These methods are taken up in the next chapter.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "gretl-guide"
%%% End: