Sophie

Sophie

distrib > Mandriva > cooker > x86_64 > by-pkgid > 6f18ed73a5a43a8071874a0f54a927a1 > files > 164

gretl-1.9.4-1.x86_64.rpm

\chapter{Discrete and censored dependent variables}
\label{chap:discr-models}

This chapter deals with models for dependent variables that are
discrete or censored or otherwise limited (as in event counts or
durations, which must be positive) and that therefore call for
estimation methods other than the classical linear model. We discuss
several estimators (mostly based on the Maximum Likelihood principle),
adding some details and examples to complement the material on
these methods in the \emph{Gretl Command Reference}.

\section{Logit and probit models}
\label{sec:logit-probit}

It often happens that one wants to specify and estimate a model in
which the dependent variable is not continuous, but discrete. A
typical example is a model in which the dependent variable is the
occupational status of an individual (1 = employed, 0 = unemployed). A
convenient way of formalizing this situation is to consider the
variable $y_i$ as a Bernoulli random variable and analyze its
distribution conditional on the explanatory variables $x_i$.  That is,
%
\begin{equation}
  \label{eq:qr-Bernoulli}
  y_i = \left\{ 
    \begin{array}{ll} 
      1 & P_i \\ 0 & 1 - P_i 
    \end{array}
    \right.
\end{equation}
%
where $P_i = P(y_i = 1 | x_i) $ is a given function of the explanatory
variables $x_i$.

In most cases, the function $P_i$ is a cumulative distribution
function $F$, applied to a linear combination of the $x_i$s. In the
probit model, the normal cdf is used, while the logit model employs
the logistic function $\Lambda()$. Therefore, we have
%
\begin{eqnarray}
  \label{eq:qr-link}
  \textrm{probit} & \qquad & P_i = F(z_i) = \Phi(z_i)  \\
  \textrm{logit}  & \qquad & P_i = F(z_i) = \Lambda(z_i) = \frac{1}{1 + e^{-z_i}} \\
  & &z_i = \sum_{j=1}^k x_{ij} \beta_j
\end{eqnarray}
%
where $z_i$ is commonly known as the \emph{index} function. Note that
in this case the coefficients $\beta_j$ cannot be interpreted as the
partial derivatives of $E(y_i | x_i)$ with respect to
$x_{ij}$.  However, for a given value of $x_i$ it is possible to
compute the vector of ``slopes'', that is
\[
  \mathrm{slope}_j(\bar{x}) = \left. \pder{F(z)}{x_j} \right|_{z =
    \bar{z}}
\]
\app{Gretl} automatically computes the slopes, setting each
explanatory variable at its sample mean.

Another, equivalent way of thinking about this model is in terms of
an unobserved variable $y^*_i$ which can be described thus:
%
\begin{equation}
  \label{eq:qr-latent}
  y^*_i = \sum_{j=1}^k x_{ij} \beta_j + \varepsilon_i = z_i  +
  \varepsilon_i 
\end{equation}
%
We observe $y_i = 1$ whenever $y^*_i > 0$ and $y_i = 0$ otherwise. If
$\varepsilon_i$ is assumed to be normal, then we have the probit
model. The logit model arises if we assume that the density function
of $\varepsilon_i$ is
%
\[
  \lambda(\varepsilon_i) =
  \pder{\Lambda(\varepsilon_i)}{\varepsilon_i} =
  \frac{e^{-\varepsilon_i}}{(1 + e^{-\varepsilon_i})^2}
\]

Both the probit and logit model are estimated in \app{gretl} via
maximum likelihood, where the log-likelihood can be written as
\begin{equation}
  \label{eq:qr-loglik}
  L(\beta) = \sum_{y_i=0} \ln [ 1 - F(z_i)] + \sum_{y_i=1} \ln F(z_i),
\end{equation}
which is always negative, since $0 < F(\cdot) < 1$.  Since the score
equations do not have a closed form solution, numerical optimization
is used. However, in most cases this is totally transparent to the
user, since usually only a few iterations are needed to ensure
convergence. The \option{verbose} switch can be used to track the
maximization algorithm.

\begin{script}[htbp]
  \caption{Estimation of simple logit and probit models}
  \label{simple-QR}
\begin{scode}
open greene19_1

logit GRADE const GPA TUCE PSI
probit GRADE const GPA TUCE PSI
\end{scode}
\end{script}

As an example, we reproduce the results given in chapter 21 of
\cite{greene00}, where the effectiveness of a program for teaching
economics is evaluated by the improvements of students' grades.
Running the code in example \ref{simple-QR} gives the following
output:
\begin{code}

Model 1: Logit estimates using the 32 observations 1-32
Dependent variable: GRADE

      VARIABLE       COEFFICIENT        STDERROR      T STAT       SLOPE
                                                                  (at mean)
  const               -13.0213           4.93132      -2.641
  GPA                   2.82611          1.26294       2.238      0.533859   
  TUCE                  0.0951577        0.141554      0.672      0.0179755  
  PSI                   2.37869          1.06456       2.234      0.449339   

  Mean of GRADE = 0.344
  Number of cases 'correctly predicted' = 26 (81.2%)
  f(beta'x) at mean of independent vars = 0.189
  McFadden's pseudo-R-squared = 0.374038
  Log-likelihood = -12.8896
  Likelihood ratio test: Chi-square(3) = 15.4042 (p-value 0.001502)
  Akaike information criterion (AIC) = 33.7793
  Schwarz Bayesian criterion (BIC) = 39.6422
  Hannan-Quinn criterion (HQC) = 35.7227

           Predicted
             0    1
  Actual 0  18    3
         1   3    8

Model 2: Probit estimates using the 32 observations 1-32
Dependent variable: GRADE

      VARIABLE       COEFFICIENT        STDERROR      T STAT       SLOPE
                                                                  (at mean)
  const                -7.45232          2.54247      -2.931
  GPA                   1.62581          0.693883      2.343      0.533347   
  TUCE                  0.0517288        0.0838903     0.617      0.0169697  
  PSI                   1.42633          0.595038      2.397      0.467908   

  Mean of GRADE = 0.344
  Number of cases 'correctly predicted' = 26 (81.2%)
  f(beta'x) at mean of independent vars = 0.328
  McFadden's pseudo-R-squared = 0.377478
  Log-likelihood = -12.8188
  Likelihood ratio test: Chi-square(3) = 15.5459 (p-value 0.001405)
  Akaike information criterion (AIC) = 33.6376
  Schwarz Bayesian criterion (BIC) = 39.5006
  Hannan-Quinn criterion (HQC) = 35.581

           Predicted
             0    1
  Actual 0  18    3
         1   3    8

\end{code}

In this context, the \dollar{uhat} accessor function takes a special
meaning: it returns generalized residuals as defined in
\citet*{gourieroux87}, which can be interpreted as unbiased estimators
of the latent disturbances $\varepsilon_t$. These are defined as
%
\begin{equation}
  \label{eq:QR-genres}
  u_i = \left\{
    \begin{array}{ll}
      y_i - \hat{P}_i & \textrm{for the logit model} \\
      y_i\cdot \frac{\phi(\hat{z}_i)}{\Phi(\hat{z}_i)} - 
      ( 1 - y_i ) \cdot \frac{\phi(\hat{z}_i)}{1 - \Phi(\hat{z}_i)}
      & \textrm{for the probit model} \\
    \end{array}
    \right.
\end{equation}

Among other uses, generalized residuals are often used for diagnostic
purposes.  For example, it is very easy to set up an omitted variables
test equivalent to the familiar LM test in the context of a linear
regression; example \ref{QR-add} shows how to perform a variable
addition test.

\begin{script}[htbp]
  \caption{Variable addition test in a probit model}
  \label{QR-add}
\begin{scode}
open greene19_1

probit GRADE const GPA PSI
series u = $uhat 
%$
ols u const GPA PSI TUCE -q
printf "Variable addition test for TUCE:\n"
printf "Rsq * T = %g (p. val. = %g)\n", $trsq, pvalue(X,1,$trsq) 
\end{scode}
\end{script}

\subsection{The perfect prediction problem}
\label{sec:perfpred}

One curious characteristic of logit and probit models is that (quite
paradoxically) estimation is not feasible if a model fits the data
perfectly; this is called the \emph{perfect prediction problem}. The
reason why this problem arises is easy to see by considering equation
(\ref{eq:qr-loglik}): if for some vector $\beta$ and scalar $k$ it's
the case that $z_i < k$ whenever $y_i=0$ and $z_i > k$ whenever
$y_i=1$, the same thing is true for any multiple of $\beta$. Hence,
$L(\beta)$ can be made arbitrarily close to 0 simply by choosing
enormous values for $\beta$. As a consequence, the log-likelihood has
no maximum, despite being bounded.

\app{Gretl} has a mechanism for preventing the algorithm from
iterating endlessly in search of a non-existent maximum. One sub-case
of interest is when the perfect prediction problem arises because of
a single binary explanatory variable. In this case, the offending
variable is dropped from the model and estimation proceeds with the
reduced specification. Nevertheless, it may happen that no single
``perfect classifier'' exists among the regressors, in which case
estimation is simply impossible and the algorithm stops with an
error. This behavior is triggered during the iteration process if
\[
  \stackunder{i: y_i = 0}{\max z_i} \, < \,
  \stackunder{i: y_i = 1}{\min z_i}  
\]
If this happens, unless your model is trivially mis-specified (like
predicting if a country is an oil exporter on the basis of oil
revenues), it is normally a small-sample problem: you probably just
don't have enough data to estimate your model. You may want to drop
some of your explanatory variables.

This problem is well analyzed in \cite{stokes04}; the results therein
are replicated in the example script \texttt{murder\_rates.inp}. 

\section{Ordered response models}
\label{sec:ordered}

These models constitute a simple variation on ordinary logit/probit
models, and are usually applied when the dependent variable is a
discrete and ordered measurement --- not simply binary, but on an
ordinal rather than an interval scale.  For example, this sort of
model may be applied when the dependent variable is a qualitative
assessment such as ``Good'', ``Average'' and ``Bad''.  

In the general case, consider an ordered response variable, $y$, that
can take on any of the $J+1$ values ${0,1,2,\dots,J}$.  We suppose, as
before, that underlying the observed response is a latent variable,
\[
  y^* = X\beta + \varepsilon = z + \varepsilon
\]
Now define ``cut points'', $\alpha_1 < \alpha_2 < \cdots < \alpha_J$,
such that
%
\begin{equation*}
  \begin{array}{ll}
    y = 0 & \textrm{if } \, y^* \leq \alpha_1 \\
    y = 1 & \textrm{if } \, \alpha_1 < y^* \leq \alpha_2 \\
    \vdots \\
    y = J & \textrm{if } \, y^* > \alpha_J \\
  \end{array}
\end{equation*}

For example, if the response takes on three values there will be two
such cut points, $\alpha_1$ and $\alpha_2$.

The probability that individual $i$ exhibits response $j$, conditional
on the characteristics $x_i$, is then given by
%
\begin{equation}
  \label{eq:QR-ordered}
  P(y_i = j \,|\, x_i) = \left\{
    \begin{array}{ll}
      P(y^* \leq \alpha_1 \,|\, x_i) = F(\alpha_1 - z_i) & \textrm{for }\, j = 0 \\
      P(\alpha_j < y^* \leq \alpha_{j+1} \,|\, x_i) = 
      F(\alpha_{j+1} - z_i) -  F(\alpha_j - z_i) & \textrm{for }\, 0 < j < J \\
      P(y^* > \alpha_J \,|\, x_i) = 1 -  F(\alpha_J - z_i) & \textrm{for }\, j = J 
    \end{array}
    \right.
\end{equation}
%
The unknown parameters $\alpha_j$ are estimated jointly with the
$\beta$s via maximum likelihood.  The $\hat{\alpha}_j$ estimates are
reported by \app{gretl} as \texttt{cut1}, \texttt{cut2} and so on.

In order to apply these models in \app{gretl}, the dependent variable
must either take on only non-negative integer values, or be explicitly
marked as discrete.  (In case the variable has non-integer values, it
will be recoded internally.)  Note that \app{gretl} does not provide a
separate command for ordered models: the \texttt{logit} and
\texttt{probit} commands automatically estimate the ordered version if
the dependent variable is acceptable, but not binary.

Example \ref{ex:oprobit} reproduces the results presented in section
15.10 of \cite{wooldridge-panel}.  The question of interest in this
analysis is what difference it makes, to the allocation of assets
in pension funds, whether individual plan participants have a
choice in the matter.  The response variable is an ordinal measure of
the weight of stocks in the pension portfolio.  Having reported the
results of estimation of the ordered model, Wooldridge illustrates the
effect of the \texttt{choice} variable by reference to an ``average''
participant.  The example script shows how one can compute this effect
in \app{gretl}.  

\begin{script}[htbp]
  \caption{Ordered probit model}
  \label{ex:oprobit}
\begin{scode}
/*
  Replicate the results in Wooldridge, Econometric Analysis of Cross
  Section and Panel Data, section 15.10, using pension-plan data from
  Papke (AER, 1998).

  The dependent variable, pctstck (percent stocks), codes the asset
  allocation responses of "mostly bonds", "mixed" and "mostly stocks"
  as {0, 50, 100}.

  The independent variable of interest is "choice", a dummy indicating
  whether individuals are able to choose their own asset allocations.
*/

open pension.gdt

# demographic characteristics of participant
list DEMOG = age educ female black married
# dummies coding for income level
list INCOME = finc25 finc35 finc50 finc75 finc100 finc101

# Papke's OLS approach
ols pctstck const choice DEMOG INCOME wealth89 prftshr
# save the OLS choice coefficient 
choice_ols = $coeff(choice)

# estimate ordered probit
probit pctstck choice DEMOG INCOME wealth89 prftshr

k = $ncoeff
matrix b = $coeff[1:k-2]
a1 = $coeff[k-1]
a2 = $coeff[k]

/* 
   Wooldridge illustrates the 'choice' effect in the ordered probit 
   by reference to a single, non-black male aged 60, with 13.5 years 
   of education, income in the range $50K - $75K and wealth of $200K, 
   participating in a plan with profit sharing.
*/
matrix X = {60, 13.5, 0, 0, 0, 0, 0, 0, 1, 0, 0, 200, 1}

# with 'choice' = 0
scalar Xb = (0 ~ X) * b
P0 = cdf(N, a1 - Xb)
P50 = cdf(N, a2 - Xb) - P0
P100 = 1 - cdf(N, a2 - Xb)
E0 = 50 * P50 + 100 * P100

# with 'choice' = 1
Xb = (1 ~ X) * b
P0 = cdf(N, a1 - Xb)
P50 = cdf(N, a2 - Xb) - P0
P100 = 1 - cdf(N, a2 - Xb)
E1 = 50 * P50 + 100 * P100

printf "\nWith choice, E(y) = %.2f, without E(y) = %.2f\n", E1, E0
printf "Estimated choice effect via ML = %.2f (OLS = %.2f)\n", E1 - E0,
  choice_ols
\end{scode}
\end{script}

After estimating ordered models, the \dollar{uhat} accessor yields
generalized residuals as in binary models; additionally, the
\dollar{yhat} accessor function returns $\hat{z}_i$, so it is
possible to compute an unbiased estimator of the latent variable
$y^*_i$ simply by adding the two together.

\section{Multinomial logit}
\label{sec:mlogit}

When the dependent variable is not binary and does not have a natural
ordering, \emph{multinomial} models are used.  Multinomial logit is
supported in \app{gretl} via the \verb|--multinomial| option to the
\texttt{logit} command.  Simple models can also be handled via the
\texttt{mle} command (see chapter \ref{chap:mle}). We give here an
example of such a model.  Let the dependent variable, $y_i$, take on
integer values $0,1,\dots p$.  The probability that $y_i = k$ is given
by
\[
  P(y_i = k |  x_i) = \frac{\exp(x_i \beta_k)}{\sum_{j=0}^p \exp(x_i \beta_j)}
\]
For the purpose of identification one of the outcomes must be taken as
the ``baseline''; it is usually assumed that $\beta_0 = 0$, in which case
\[
  P(y_i = k |  x_i) = \frac{\exp(x_i \beta_k)}{1 + \sum_{j=1}^p \exp(x_i \beta_j)} 
\]
and
\[
  P(y_i = 0 |  x_i) = \frac{1}{1 + \sum_{j=1}^p \exp(x_i \beta_j)} .
\]

Example~\ref{ex:mlogit} reproduces Table 15.2 in
\cite{wooldridge-panel}, based on data on career choice from
\cite{keane97}.  The dependent variable is the occupational status of
an individual (0 = in school; 1 = not in school and not working; 2 =
working), and the explanatory variables are education and work
experience (linear and square) plus a ``black'' binary variable.  The
full data set is a panel; here the analysis is confined to a
cross-section for 1987.  For explanations of the matrix methods
employed in the script, see chapter~\ref{chap:matrices}.

\begin{script}[htbp]
  \caption{Multinomial logit}
  \label{ex:mlogit}
\begin{scode}
function series mlogitlogprobs (series y, matrix X, matrix theta)
  scalar n = max(y)
  scalar k = cols(X)
  matrix b = mshape(theta,k,n)
  matrix tmp = X*b
  series ret = -ln(1 + sumr(exp(tmp)))

  loop for i=1..n --quiet
    series x = tmp[,i]
    ret += (y=$i)? x : 0 
  endloop

  return ret
end function

open keane.gdt

# for the manual mle variant the dep. var. must be 0-based
status = status - 1
# and we must exclude missing values
smpl (year=87 && ok(status)) --restrict

matrix X = { const, educ, exper, expersq, black }
scalar k = cols(X)
matrix theta = zeros(2*k, 1)

mle loglik = mlogitlogprobs(status,X,theta)
  params theta
end mle --hessian

# Compare the built-in command (in this case we don't need
# status to be 0-based, and NAs are handled correctly)
smpl --full
status = status + 1
smpl (year=87) --restrict
logit status 0 educ exper expersq black --multinomial
\end{scode}
%$
\end{script}


\section{The Tobit model}
\label{sec:tobit}

The Tobit model is used when the dependent variable of a model is
\emph{censored}.\footnote{We assume here that censoring occurs from
  below at 0. Censoring from above, or at a point different from zero,
  can be rather easily handled by re-defining the dependent variable
  appropriately. For the more general case of two-sided censoring the
  \cmd{intreg} command may be used (see below).}  
Assume a latent variable $y^*_i$ can be described
as
%
\[
  y^*_i = \sum_{j=1}^k x_{ij} \beta_j + \varepsilon_i ,
\]
%
where $\varepsilon_i \sim N(0,\sigma^2)$. If $y^*_i$ were observable,
the model's parameters could be estimated via ordinary least squares.
On the contrary, suppose that we observe $y_i$, defined as
%
\begin{equation}
  \label{eq:tobit}
  y_i = \left\{ 
    \begin{array}{ll} 
      y^*_i & \mathrm{for} \quad y^*_i > 0 \\ 
      0 & \mathrm{for} \quad y^*_i \le 0 
    \end{array}
    \right. 
\end{equation}
%
In this case, regressing $y_i$ on the $x_i$s does not yield
consistent estimates of the parameters $\beta$, because the
conditional mean $E(y_i|x_i)$ is not equal to $\sum_{j=1}^k x_{ij}
\beta_j$.  It can be shown that restricting the sample to non-zero
observations would not yield consistent estimates either. The solution
is to estimate the parameters via maximum likelihood. The syntax is
simply
%
\begin{code}
tobit depvar indvars
\end{code}

As usual, progress of the maximization algorithm can be tracked via
the \option{verbose} switch, while \dollar{uhat} returns the
generalized residuals. Note that in this case the generalized residual
is defined as $\hat{u}_i = E(\varepsilon_i | y_i = 0)$ for censored
observations, so the familiar equality $\hat{u}_i = y_i - \hat{y}_i$
only holds for uncensored observations, that is, when $y_i>0$.

An important difference between the Tobit estimator and OLS is that
the consequences of non-normality of the disturbance term are much
more severe: non-normality implies inconsistency for the Tobit
estimator. For this reason, the output for the tobit model includes
the \cite{chesher-irish87} normality test by default.

\section{Interval regression}
\label{sec:intreg}

The interval regression model arises when the dependent variable is
unobserved for some (possibly all) observations; what we observe
instead is an interval in which the dependent variable lies.  In other
words, the data generating process is assumed to be
\[
  y^*_i = x_i \beta + \epsilon_i
\] 
but we only know that $m_i \le y^*_i \le M_i$, where the interval may
be left- or right-unbounded (but not both). If $m_i = M_i$, we
effectively observe $y^*_i$ and no information loss occurs. In
practice, each observation belongs to one of four categories:
\begin{enumerate}
\item left-unbounded, when $m_i = -\infty$,
\item right-unbounded, when $M_i = \infty$,
\item bounded, when $-\infty < m_i < M_i <\infty$ and
\item point observations when $m_i = M_i$.
\end{enumerate}

It is interesting to note that this model bears similarities to other
models in several special cases:
\begin{itemize}
\item When all observations are point observations the model trivially
  reduces to the ordinary linear regression model.
\item When $m_i = M_i$ when $y^*_i> 0$, while $m_i = -\infty$ and $M_i
  = 0$ otherwise, we have the Tobit model (see \ref{sec:tobit}).
\item The interval model could be thought of an ordered probit model
  (see \ref{sec:ordered}) in which the cut points (the $\alpha_j$
  coefficients in eq. \ref{eq:QR-ordered}) are observed and don't need
  to be estimated.
\end{itemize}

The \app{gretl} command \texttt{intreg} estimates interval models by
maximum likelihood, assuming normality of the disturbance term
$\epsilon_i$.  Its syntax is
%
\begin{code}
intreg minvar maxvar X
\end{code}
%
where \texttt{minvar} contains the $m_i$ series, with \texttt{NA}s for
left-unbounded observations, and \texttt{maxvar} contains $M_i$, with
\texttt{NA}s for right-unbounded observations.  By default, standard
errors are computed using the negative inverse of the Hessian.  If the
\option{robust} flag is given, then QML or Huber--White standard
errors are calculated instead. In this case the estimated covariance
matrix is a ``sandwich'' of the inverse of the estimated Hessian and
the outer product of the gradient.

If the model specification contains regressors other than just a
constant, the output includes a chi-square statistic for testing the
joint null hypothesis that none of these regressors has any effect on
the outcome.  This is a Wald statistic based on the estimated
covariance matrix.  If you wish to construct a likelihood ratio test,
this is easily done by estimating both the full model and the null
model (containing only the constant), saving the log-likelihood in
both cases via the \dollar{lnl} accessor, and then referring twice the
difference between the two log-likelihoods to the chi-square
distribution with $k$ degrees of freedom, where $k$ is the number of
additional regressors (see the \texttt{pvalue} command in the
\GCR). An example is contained in the sample script \texttt{wtp.inp},
provided with the \app{gretl} distribution.

\begin{script}[ht]
  \caption{Interval model on artificial data}
  \label{ex:interval}
Input:
\begin{scodebit}
nulldata 100
# generate artificial data
set seed 201449 
x = normal()
epsilon = 0.2*normal()
ystar = 1 + x + epsilon
lo_bound = floor(ystar)
hi_bound = ceil(ystar)

# run the interval model
intreg lo_bound hi_bound const x

# estimate ystar
gen_resid = $uhat
yhat = $yhat + gen_resid
corr ystar yhat 
\end{scodebit}
Output (selected portions):
\begin{scodebit}
Model 1: Interval estimates using the 100 observations 1-100
Lower limit: lo_bound, Upper limit: hi_bound

             coefficient   std. error   t-ratio    p-value 
  ---------------------------------------------------------
  const       0.993762     0.0338325     29.37    1.22e-189 ***
  x           0.986662     0.0319959     30.84    8.34e-209 ***

Chi-square(1)        950.9270   p-value              8.3e-209
Log-likelihood      -44.21258   Akaike criterion     94.42517
Schwarz criterion    102.2407   Hannan-Quinn         97.58824

sigma = 0.223273
Left-unbounded observations: 0
Right-unbounded observations: 0
Bounded observations: 100
Point observations: 0

...

corr(ystar, yhat) = 0.98960092
Under the null hypothesis of no correlation:
 t(98) = 68.1071, with two-tailed p-value 0.0000
\end{scodebit}
\end{script}

As with the probit and Tobit models, after a model has been estimated
the \dollar{uhat} accessor returns the generalized residual, which is
an estimate of $\epsilon_i$: more precisely, it equals $y_i - x_i
\hat{\beta}$ for point observations and $E(\epsilon_i| m_i, M_i, x_i)$
otherwise. Note that it is possible to compute an unbiased predictor
of $y^*_i$ by summing this estimate to $x_i \hat{\beta}$. Script
\ref{ex:interval} shows an example. As a further similarity with
Tobit, the interval regression model may deliver inconsistent
estimates if the disturbances are non-normal; hence, the
\cite{chesher-irish87} test for normality is included by default here
too.


\section{Sample selection model}
\label{sec:heckit}

In the sample selection model (also known as ``Tobit II'' model),
there are two latent variables:
%
\begin{eqnarray}
  \label{eq:heckit1}
  y^*_i & = & \sum_{j=1}^k x_{ij} \beta_j + \varepsilon_i \\
  \label{eq:heckit2}
  s^*_i & = & \sum_{j=1}^p z_{ij} \gamma_j + \eta_i 
\end{eqnarray}
%
and the observation rule is given by
%
\begin{equation}
  \label{eq:tobitII}
  y_i = \left\{ 
    \begin{array}{ll} 
      y^*_i & \mathrm{for} \quad s^*_i > 0 \\ 
      \diamondsuit & \mathrm{for} \quad s^*_i \le 0 
    \end{array}
    \right. 
\end{equation}

In this context, the $\diamondsuit$ symbol indicates that for some
observations we simply do not have data on $y$: $y_i$ may be 0, or
missing, or anything else. A dummy variable $d_i$ is normally used to
set censored observations apart.

One of the most popular applications of this model in econometrics is
a wage equation coupled with a labor force participation equation: we
only observe the wage for the employed. If $y^*_i$ and $s^*_i$ were
(conditionally) independent, there would be no reason not to use OLS
for estimating equation (\ref{eq:heckit1}); otherwise, OLS does not
yield consistent estimates of the parameters $\beta_j$.

Since conditional independence between $y^*_i$ and $s^*_i$ is
equivalent to conditional independence between $\varepsilon_i$ and
$\eta_i$, one may model the co-dependence between $\varepsilon_i$ and
$\eta_i$ as 
\[
  \varepsilon_i = \lambda \eta_i + v_i ;
\]
substituting the above expression in (\ref{eq:heckit1}), you obtain
the model that is actually estimated:
\[
  y_i = \sum_{j=1}^k x_{ij} \beta_j + \lambda \hat{\eta}_i + v_i ,
\]
so the hypothesis that censoring does not matter is equivalent to the
hypothesis $H_0: \lambda = 0$, which can be easily tested.

The parameters can be estimated via maximum likelihood under the
assumption of joint normality of $\varepsilon_i$ and $\eta_i$;
however, a widely used alternative method yields the so-called
\emph{Heckit} estimator, named after \cite{heckman79}. The procedure
can be briefly outlined as follows: first, a probit model is fit on
equation (\ref{eq:heckit2}); next, the generalized residuals are
inserted in equation (\ref{eq:heckit1}) to correct for the effect of
sample selection.

\app{Gretl} provides the \texttt{heckit} command to carry out
estimation; its syntax is
%
\begin{code}
heckit y X ; d Z
\end{code}
%
where \texttt{y} is the dependent variable, \texttt{X} is a list of
regressors, \texttt{d} is a dummy variable holding 1 for uncensored
observations and \texttt{Z} is a list of explanatory variables for the
censoring equation.

Since in most cases maximum likelihood is the method of
choice, by default \app{gretl} computes ML estimates. The 2-step
Heckit estimates can be obtained by using the \option{two-step}
option. After estimation, the \dollar{uhat} accessor contains the
generalized residuals. As in the ordinary Tobit model, the residuals
equal the difference between actual and fitted $y_i$ only for
uncensored observations (those for which $d_i = 1$).

Example \ref{ex:heckit} shows two estimates from the dataset used in
\cite{mroz87}: the first one replicates Table 22.7 in
\cite{greene03},\footnote{Note that the estimates given by \app{gretl}
  do not coincide with those found in the printed volume.  They do,
  however, match those found on the errata web page for Greene's book:
  \url{http://pages.stern.nyu.edu/~wgreene/Text/Errata/ERRATA5.htm}.}
while the second one replicates table 17.1 in \cite{wooldridge-panel}.

\begin{script}[htbp]
  \caption{Heckit model}
  \label{ex:heckit}
\begin{scode}
open mroz87.gdt

genr EXP2 = AX^2
genr WA2 = WA^2
genr KIDS = (KL6+K618)>0

# Greene's specification

list X = const AX EXP2 WE CIT
list Z = const WA WA2 FAMINC KIDS WE

heckit WW X ; LFP Z --two-step 
heckit WW X ; LFP Z 

# Wooldridge's specification

series NWINC = FAMINC - WW*WHRS
series lww = log(WW)
list X = const WE AX EXP2
list Z = X NWINC WA KL6 K618

heckit lww X ; LFP Z --two-step 
\end{scode}
\end{script}

\section{Count data}
\label{sec:count}

To be written.

\section{Duration models}
\label{sec:duration}

In some contexts we wish to apply econometric methods to measurements
of the duration of certain states. Classic examples include the
following:
\begin{itemize}
\item From enginering, the ``time to failure'' of electronic or
  mechanical components: how long do, say, computer hard drives
  last until they malfunction?
\item From the medical realm: how does a new treatment affect the
  time from diagnosis of a certain condition to exit from that
  condition (where ``exit'' might mean death or full recovery)?
\item From economics: the duration of strikes, or of spells of
  unemployment.
\end{itemize}

In each case we may be interested in how the durations are
distributed, and how they are affected by relevant covariates.  There
are several approaches to this problem; the one we discuss here ---
which is currently the only one supported by \app{gretl} --- is
estimation of a parametric model by means of Maximum Likelihood.  In
this approach we hypothesize that the durations follow some definite
probability law and we seek to estimate the parameters of that law,
factoring in the influence of covariates. 

We may express the density (PDF) of the durations as $f(t, X,
\theta)$, where $t$ is the length of time in the state in question,
$X$ is a matrix of covariates, and $\theta$ is a vector of parameters.
The likelihood for a sample of $n$ observations indexed by $i$ is then
\[
L = \prod_{i=1}^n f(t_i, x_i, \theta)
\]

Rather than working with the density directly, however, it is standard
practice to factor $f(\cdot)$ into two components, namely a
\emph{hazard function}, $\lambda$, and a \emph{survivor function},
$S$.  The survivor function gives the probability that a state lasts
at least as long as $t$; it is therefore $1 - F(t, X, \theta)$ where
$F$ is the CDF corresponding to the density $f(\cdot)$. The hazard
function addresses this question: given that a state has persisted as
long as $t$, what is the likelihood that it ends within a short
increment of time beyond $t$ --- that is, it ends between $t$ and $t +
\Delta$?  Taking the limit as $\Delta$ goes to zero, we end up with
the ratio of the density to the survivor function:\footnote{For a
  fuller discussion see, for example, \cite{davidson-mackinnon04}.}
\begin{equation}
\label{eq:surv-decomp}
\lambda(t, X, \theta) = \frac{f(t, X, \theta)}{S(t, X, \theta)}
\end{equation}
so the log-likelihood can be written as
\begin{equation}
\label{eq:surv-loglik}
\ell = \sum_{i=1}^n \log f(t_i, x_i, \theta) = 
\sum_{i=1}^n \log \lambda(t_i, x_i, \theta) + 
\log S(t_i, x_i, \theta)
\end{equation}

One point of interest is the shape of the hazard function, in
particular its dependence (or not) on time since the state began.  If
$\lambda$ does not depend on $t$ we say the process in question exhibits
\emph{duration independence}: the probability of exiting the state at
any given moment neither increases nor decreases based simply on how
long the state has persisted to date. The alternatives are positive
duration dependence (the likelihood of exiting the state rises, the
longer the state has persisted) or negative duration dependence (exit
becomes less likely, the longer it has persisted).  Finally, the
behavior of the hazard with respect to time need not be monotonic;
some parameterizations allow for this possibility and some do not.

Since durations are inherently positive the probability distribution
used in modeling must respect this requirement, giving a density of
zero for $t \leq 0$.  Four common candidates are the exponential,
Weibull, log-logistic and log-normal, the Weibull being the most
common choice. The table below displays the density and the hazard
function for each of these distributions as they are commonly
parameterized, written as functions of $t$ alone. ($\phi$ and $\Phi$
denote, respectively, the Gaussian PDF and CDF.)

\begin{center}
\setlength\tabcolsep{1.5em}
\begin{tabular}{lll}
 & density, $f(t)$ & hazard, $\lambda(t)$ \\ [1ex]
Exponential & $\displaystyle 
\gamma \exp\,(-\gamma t)$ &$\displaystyle
\gamma$ \\ [1ex]
Weibull & $\displaystyle
\alpha\gamma^{\alpha}t^{\alpha-1}\exp\left[-(\gamma t)^\alpha\right]$ 
& $\displaystyle \alpha\gamma^{\alpha}t^{\alpha-1}$ \\ [1ex]
Log-logistic & $\displaystyle \gamma\alpha
\frac{(\gamma t)^{\alpha-1}}
{\left[1 + (\gamma t)^\alpha\right]^2}$ 
& $\displaystyle \gamma\alpha
\frac{(\gamma t)^{\alpha-1}}
{\left[1 + (\gamma t)^\alpha\right]}$ \\ [2.5ex]
Log-normal & $\displaystyle
\frac{1}{\sigma t} \phi\left[(\log t - \mu)/\sigma \right]$ 
& $\displaystyle
\frac{1}{\sigma t} \frac{\phi\left[(\log t - \mu)/\sigma \right]}
{\Phi\left[-(\log t - \mu)/\sigma \right]}$
\end{tabular}
\end{center}

The hazard is constant for the exponential distribution.  For the
Weibull, it is monotone increasing in $t$ if $\alpha > 1$, or
monotone decreasing for $\alpha < 1$. (If $\alpha = 1$ the Weibull
collapses to the exponential.)  The log-logistic and log-normal
distributions allow the hazard to vary with $t$ in a non-monotonic
fashion.

Covariates are brought into the picture by allowing them to govern one
of the parameters of the density, so that the durations are not
identically distributed across cases.  For example, when using the
log-normal distribution it is natural to make $\mu$, the expected
value of $\log t$, depend on the covariates, $X$.  This is typically
done via a linear index function: $\mu = X\beta$.

Note that the expressions for the log-normal density and hazard
contain the term $(\log t - \mu)/\sigma$.  Replacing $\mu$ with
$X\beta$ this becomes $(\log t - X\beta)/\sigma$.  It turns out that
this constitutes a useful simplifying change of variables for all of
the distributions discussed here. As in \cite{kalbfleisch02}, we
define
\[
w_i \equiv (\log t_i - x_i\beta)/\sigma
\]
The interpretation of the scale factor, $\sigma$, in this expression
depends on the distribution. For the log-normal, $\sigma$ represents
the standard deviation of $\log t$; for the Weibull and the
log-logistic it corresponds to $1/\alpha$; and for the exponential it
is fixed at unity. For distributions other than the log-normal,
$X\beta$ corresponds to $-\log \gamma$, or in other words $\gamma =
\exp(-X\beta)$.

With this change of variables, the density and survivor functions may
be written compactly as follows (the exponential is the same as the
Weibull).

\begin{center}
\setlength\tabcolsep{1.5em}
\begin{tabular}{lll}
 & density, $f(w_i)$ & survivor, $S(w_i)$ \\ [4pt]
Weibull & 
$\exp\left(w_i - e^{w_i}\right)$ & $\exp(-e^{w_i})$
\\ [4pt]
Log-logistic & 
$e^{w_i} \left(1 + e^{w_i}\right)^{-2}$ 
& $\left(1 + e^{w_i}\right)^{-1}$ \\ [1ex]
Log-normal & $\phi(w_i)$ & $\Phi(-w_i)$
\end{tabular}
\end{center}

In light of the above we may think of the generic parameter vector
$\theta$, as in $f(t, X, \theta)$, as composed of the coefficients on
the covariates, $\beta$, plus (in all cases apart from the exponential)
the additional parameter $\sigma$.

A complication in estimation of $\theta$ is posed by ``incomplete
spells''. That is, in some cases the state in question may not have
ended at the time the observation is made (e.g.\ some workers remain
unemployed, some components have not yet failed).  If we use $t_i$ to
denote the time from entering the state to either (a) exiting the
state or (b) the observation window closing, whichever comes first,
then all we know of the ``right-censored'' cases (b) is that the
duration was at least as long as $t_i$. This can be handled by
rewriting the the log-likelihood (compare \ref{eq:surv-loglik}) as
\begin{equation}
\label{eq:surv-loglik2}
\ell_i = \sum_{i=1}^n \delta_i\log S\left(w_i\right)
+ \left(1-\delta_i\right) 
\left[-\log\sigma + \log f\left(w_i\right)\right]
\end{equation}
where $\delta_i$ equals 1 for censored cases (incomplete spells), and
0 for complete observations. The rationale for this is that the
log-density equals the sum of the log hazard and the log survivor
function, but for the incomplete spells only the survivor function
contributes to the likelihood. So in (\ref{eq:surv-loglik2}) we are
adding up the log survivor function alone for the incomplete cases,
plus the full log density for the completed cases.

\subsection{Implementation in \app{gretl} and illustration}

The \texttt{duration} command accepts a list of series on the usual
pattern: dependent variable followed by covariates. If right-censoring
is present in the data this should be represented by a dummy variable
corresponding to $\delta_i$ above, separated from the covariates by
a semicolon. For example,
\begin{code}
duration durat 0 X ; cens
\end{code}
where \texttt{durat} measures durations, \texttt{0} represents the
constant (which is required for such models), \texttt{X} is a named
list of regressors, and \texttt{cens} is the censoring dummy.

By default the Weibull distribution is used; you can substitute any of
the other three distributions discussed here by appending one of the
option flags \verb|--exponential|, \verb|--loglogistic| or 
\verb|--lognormal|.

Interpreting the coefficients in a duration model requires some care,
and we will work through an illustrative case. The example comes from
section 20.3 of \cite{wooldridge-panel}, and it concerns criminal
recidivism.\footnote{Germán Rodríguez of Princeton University has a
  page disussing this example and displaying estimates from
  \app{Stata} at \url{http://data.princeton.edu/pop509a/recid1.html}.}
The data (filename \texttt{recid.gdt}) pertain to a sample of 1,445
convicts released from prison between July 1, 1977 and June 30,
1978. The dependent variable is the time in months until they are
again arrested. The information was gathered retrospectively by
examining records in April 1984; the maximum possible length of
observation is 81 months.  Right-censoring is important: when the date
were compiled about 62 percent had not been arrested.  The dataset
contains several covariates, which are described in the data file; we
will focus below on interpretation of the \texttt{married} variable, a
dummy which equals 1 if the respondent was married when imprisoned.

Example~\ref{ex:duration1} shows the \app{gretl} commands for a
Weibull model along with most of the output.  Consider first the scale
factor, $\sigma$. The estimate is 1.241 with a standard error of
0.048.  (We don't print a $z$ score and $p$-value for this term since
$H_0: \sigma = 0$ is not of interest.)  Recall that $\sigma$
corresponds to $1/\alpha$; we can be confident that $\alpha$ is less
than 1, so recidivism displays negative duration dependence.  This
makes sense: it is plausible that if a past offender manages to stay
out of trouble for an extended period his risk of engaging in crime
again diminishes. (The exponential model would therefore not be
appropriate in this case.)

\begin{script}[htbp]
  \caption{Weibull model for recidivism data}
Input:
\begin{scodebit}
open recid.gdt
list X = workprg priors tserved felon alcohol drugs \
 black married educ age
duration durat 0 X ; cens
duration durat 0 X ; cens --lognormal
\end{scodebit}
Partial output:
\begin{scodebit}
Model 1: Duration (Weibull), using observations 1-1445
Dependent variable: durat

             coefficient   std. error      z      p-value 
  --------------------------------------------------------
  const       4.22167      0.341311      12.37    3.85e-35 ***
  workprg    -0.112785     0.112535      -1.002   0.3162  
  priors     -0.110176     0.0170675     -6.455   1.08e-10 ***
  tserved    -0.0168297    0.00213029    -7.900   2.78e-15 ***
  felon       0.371623     0.131995       2.815   0.0049   ***
  alcohol    -0.555132     0.132243      -4.198   2.69e-05 ***
  drugs      -0.349265     0.121880      -2.866   0.0042   ***
  black      -0.563016     0.110817      -5.081   3.76e-07 ***
  married     0.188104     0.135752       1.386   0.1659  
  educ        0.0289111    0.0241153      1.199   0.2306  
  age         0.00462188   0.000664820    6.952   3.60e-12 ***
  sigma       1.24090      0.0482896                      

Chi-square(10)       165.4772   p-value              2.39e-30
Log-likelihood      -1633.032   Akaike criterion     3290.065

Model 2: Duration (log-normal), using observations 1-1445
Dependent variable: durat

             coefficient   std. error       z      p-value 
  ---------------------------------------------------------
  const       4.09939      0.347535      11.80     4.11e-32 ***
  workprg    -0.0625693    0.120037      -0.5213   0.6022  
  priors     -0.137253     0.0214587     -6.396    1.59e-10 ***
  tserved    -0.0193306    0.00297792    -6.491    8.51e-11 ***
  felon       0.443995     0.145087       3.060    0.0022   ***
  alcohol    -0.634909     0.144217      -4.402    1.07e-05 ***
  drugs      -0.298159     0.132736      -2.246    0.0247   **
  black      -0.542719     0.117443      -4.621    3.82e-06 ***
  married     0.340682     0.139843       2.436    0.0148   **
  educ        0.0229194    0.0253974      0.9024   0.3668  
  age         0.00391028   0.000606205    6.450    1.12e-10 ***
  sigma       1.81047      0.0623022                       

Chi-square(10)       166.7361   p-value              1.31e-30
Log-likelihood      -1597.059   Akaike criterion     3218.118

\end{scodebit}
\label{ex:duration1}
\end{script}

On a priori grounds, however, we may doubt the monotonic decline in
hazard that is simplied by the Weibull specification. Even if a person
is liable to return to crime, it seems relatively unlikely that he
would do so straight out of prison. In the data, we find that only 2.6
percent of those followed were rearrested within 3 months. The
log-normal specification, which allows the hazard to rise and then
fall, may be more appropriate.  Using the \texttt{duration} command
again with the same covariates but the \verb|--lognormal| flag, we get
a log-likelihood of $-1597$ as against $-1633$ for the Weibull,
confirming that the log-normal gives a better fit.

Let us now focus on the \texttt{married} coefficient, which is
positive in both specifications but larger and more sharply estimated
in the log-normal variant. The first thing is to get the
interpretation of the sign right.  Recall that $X\beta$ enters
negatively into the intermediate variable $w$. The Weibull hazard is 
$\lambda(w_i) = e^{w_i}$, so being married reduces the hazard of
re-offending, or in other words lengthens the expected duration out of
prison.  The same qualititive interpretation applies for the
log-normal.  

To get a better sense of the married effect, it is useful to show its
impact on the hazard across time. We can do this by plotting the
hazard for two values of the index function $X\beta$: in each case the
values of all the covariates other than \texttt{married} are set to
their means (or some chosen values) while \texttt{married} is set
first to 0 then to 1. Example~\ref{ex:hazard-plots} provides a script
that does this, and the resulting plots are shown in
Figure~\ref{fig:hazard-plots}. Note that when computing the hazards we
need to multiply by the Jacobian of the transformation from $t_i$ to
$w_i = \log (t_i - x_i\beta)/\sigma$, namely $1/t$.  Note also that
the estimate of $\sigma$ is available via the accessior \verb|$sigma|,
but it is also present as the last element in the coefficient vector
obtained via \verb|$coeff|.

A further difference between the Weibull and log-normal specifications
is illustrated in the plots. The Weibull is an instance of a
\emph{proportional hazard} model. This means that for any sets of
values of the covariates, $x_i$ and $x_j$, the ratio of the associated
hazards is invariant with respect to duration. In this example the
Weibull hazard for unmarried individuals is always 1.1637 times that
for married. In the log-normal variant, on the other hand, this ratio
gradually declines from 1.6703 at one month to 1.1766 at 100 months.


\begin{script}[htbp]
  \caption{Create plots showing conditional hazards}
\begin{scode}
open recid.gdt -q

# leave 'married' separate for analysis
list X = workprg priors tserved felon alcohol drugs \
 black educ age

# Weibull variant
duration durat 0 X married ; cens
# coefficients on all Xs apart from married
matrix beta_w = $coeff[1:$ncoeff-2]
# married coefficient
scalar mc_w = $coeff[$ncoeff-1]
scalar s_w = $sigma

# Log-normal variant
duration durat 0 X married ; cens --lognormal
matrix beta_n = $coeff[1:$ncoeff-2]
scalar mc_n = $coeff[$ncoeff-1]
scalar s_n = $sigma

list allX = 0 X
# evaluate X\beta at means of all variables except marriage
scalar Xb_w = meanc({allX}) * beta_w
scalar Xb_n = meanc({allX}) * beta_n

# construct two plot matrices
matrix mat_w = zeros(100, 3)
matrix mat_n = zeros(100, 3)

loop t=1..100 -q
  # first column, duration
  mat_w[t, 1] = t
  mat_n[t, 1] = t
  wi_w = (log(t) - Xb_w)/s_w
  wi_n = (log(t) - Xb_n)/s_n
  # second col: hazard with married = 0
  mat_w[t, 2] = (1/t) * exp(wi_w)
  mat_n[t, 2] = (1/t) * pdf(z, wi_n) / cdf(z, -wi_n)
  wi_w = (log(t) - (Xb_w + mc_w))/s_w
  wi_n = (log(t) - (Xb_n + mc_n))/s_n
  # third col: hazard with married = 1
  mat_w[t, 3] = (1/t) * exp(wi_w)
  mat_n[t, 3] = (1/t) * pdf(z, wi_n) / cdf(z, -wi_n)
endloop

colnames(mat_w, "months unmarried married")
colnames(mat_n, "months unmarried married")

gnuplot 2 3 1 --with-lines --supp --matrix=mat_w --output=weibull.plt
gnuplot 2 3 1 --with-lines --supp --matrix=mat_n --output=lognorm.plt
\end{scode}
\label{ex:hazard-plots}
\end{script}

\begin{figure}[htbp]
\centering
\includegraphics[scale=0.85]{figures/weibull}
\includegraphics[scale=0.85]{figures/lognorm}
\caption{Recidivism hazard estimates for married and unmarried
  ex-convicts}
\label{fig:hazard-plots}
\end{figure}

\subsection{Alternative representations of the Weibull model}

One point to watch out for with the Weibull duration model is that the
estimates may be represented in different ways.  The representation
given by \app{gretl} is sometimes called the \textit{accelerated
  failure-time} (AFT) metric. An alternative that one sometimes sees
is the \textit{log relative-hazard} metric; in fact this is the metric
used in Wooldridge's presentation of the recidivism example.  To get
from AFT estimates to log relative-hazard form it is necessary to
multiply the coefficients by $-\sigma^{-1}$. For example, the
\texttt{married} coefficient in the Weibull specification as shown
here is 0.188104 and $\hat{\sigma}$ is 1.24090, so the alternative
value is $-0.152$, which is what Wooldridge shows
(\citeyear{wooldridge-panel}, Table 20.1).

\subsection{Fitted values and residuals}

By default, \app{gretl} computes fitted values (accessible via
\dollar{yhat}) as the conditional mean of duration.  The formulae
are shown below (where $\Gamma$ denotes the gamma function, and the
exponential variant is just Weibull with $\sigma = 1$).

\begin{center}
\setlength\tabcolsep{1em}
\begin{tabular}{ccc}
Weibull & Log-logistic & Log-normal \\ [4pt]
$\exp(X\beta)\Gamma(1 + \sigma)$ &
$\displaystyle \exp(X\beta)\frac{\pi \sigma}{\sin(\pi \sigma)}$ &
$\exp(X\beta + \sigma^2/2)$
\end{tabular}
\end{center}

The expression given for the log-logistic mean, however, is valid only
for $\sigma < 1$; otherwise the expectation is undefined, a point that
is not noted in all software.\footnote{The \texttt{predict} adjunct to
  the \texttt{streg} command in \app{Stata} 10, for example, gaily
  produces large negative values for the log-logistic mean in duration
  models with $\sigma > 1$.}

Alternatively, if the \verb|--medians| option is given, \app{gretl}'s
\texttt{duration} command will produce conditional medians as the
content of \dollar{yhat}.  For the Weibull the median is
$\exp(X\beta)(\log 2)^\sigma$; for the log-logistic and log-normal it
is just $\exp(X\beta)$.

The values we give for the accessor \dollar{uhat} are generalized
(Cox--Snell) residuals, computed as the integrated hazard function,
which equals the negative of the log of the survivor function:
\[
\hat{u}_i = \Lambda(t_i, x_i, \theta) = -\log S(t_i, x_i, \theta)
\]
Under the null of correct specification of the model these generalized
residuals should follow the unit exponential distribution, which has
mean and variance both equal to 1 and density $e^{-1}$. See
\cite{cameron-trivedi05} for further discussion.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "gretl-guide"
%%% End: