Sophie: gretl-1.9.4-1 x86

gretl-1.9.4-1.x86_64.rpm

\chapter{Data files}
\label{datafiles}

\section{Native format}
\label{native-format}

\app{gretl} has its own format for data files.  Most users will
probably not want to read or write such files outside of \app{gretl}
itself, but occasionally this may be useful and full details on the
file formats are given in Appendix~\ref{app-datafile}.

\section{Other data file formats}
\label{other-formats}

\app{gretl} will read various other data formats.
    
\begin{itemize}
\item Plain text (ASCII) files.  These can be brought in using
  \app{gretl}'s ``File, Open Data, Import ASCII\dots{}'' menu item, or
  the \cmd{import} script command.  For details on what \app{gretl}
  expects of such files, see Section~\ref{scratch}.
\item Comma-Separated Values (CSV) files.  These can be imported using
  \app{gretl}'s ``File, Open Data, Import CSV\dots{}'' menu item, or
  the \cmd{import} script command. See also Section~\ref{scratch}.
\item Spreadsheets: MS \app{Excel}, \app{Gnumeric} and Open Document
  (ODS).  These are also brought in using \app{gretl}'s ``File, Open
  Data, Import'' menu.  The requirements for such files are given in
  Section~\ref{scratch}.
\item \app{Stata} data files (\texttt{.dta}).
\item \app{SPSS} data files (\texttt{.sav}).
\item \app{Eviews} workfiles (\texttt{.wf1}).\footnote{See
    \url{http://www.ecn.wfu.edu/eviews_format/}.}
\item \app{JMulTi} data files.
\end{itemize}

When you import data from the ASCII or CSV formats, \app{gretl}
opens a ``diagnostic'' window, reporting on its progress in reading
the data.  If you encounter a problem with ill-formatted data, the
messages in this window should give you a handle on fixing the
problem.

As of version 1.7.5, \app{gretl} also offers ODBC connctivity. Be
warned: this is a recent feature meant for somewhat advanced users; it
may still have a few rough edges and there is no GUI interface for
this yet. Interested readers will find more info in appendix
\ref{chap:odbc}.

For the convenience of anyone wanting to carry out more complex data
analysis, \app{gretl} has a facility for writing out data in the
native formats of GNU \app{R}, \app{Octave}, \app{JMulTi} and
\app{PcGive} (see Appendix~\ref{app-advanced}).  In the GUI client
this option is found under the ``File, Export data'' menu; in the
command-line client use the \cmd{store} command with the appropriate
option flag.


\section{Binary databases}
\label{dbase}

For working with large amounts of data \app{gretl} is supplied
with a database-handling routine.  A \emph{database}, as opposed to a
\emph{data file}, is not read directly into the program's workspace.
A database can contain series of mixed frequencies and sample ranges.
You open the database and select series to import into the working
dataset.  You can then save those series in a native format data file
if you wish. Databases can be accessed via \app{gretl}'s menu item
``File, Databases''.

For details on the format of \app{gretl} databases, see
Appendix~\ref{app-datafile}.

\subsection{Online access to databases}
\label{online-data}

As of version 0.40, \app{gretl} is able to access databases via the
internet.  Several databases are available from Wake Forest
University.  Your computer must be connected to the internet for this
option to work.  Please see the description of the ``data'' command under
\app{gretl}'s Help menu.

\tip{Visit the \app{gretl}
  \href{http://gretl.sourceforge.net/gretl_data.html}{data page} for
  details and updates on available data.}


\subsection{Foreign database formats}
\label{RATS}

Thanks to Thomas Doan of \emph{Estima}, who made available the
specification of the database format used by RATS 4 (Regression
Analysis of Time Series), \app{gretl} can handle such databases
--- or at least, a subset of same, namely time-series databases
containing monthly and quarterly series.  

\app{Gretl} can also import data from \app{PcGive} databases.  These
take the form of a pair of files, one containing the actual data (with
suffix \texttt{.bn7}) and one containing supplementary information
(\texttt{.in7}).  

\section{Creating a data file from scratch}
\label{scratch}

There are several ways of doing this:

\begin{enumerate}
\item Find, or create using a text editor, a plain text data file and
  open it with \app{gretl}'s ``Import ASCII'' option.
\item Use your favorite spreadsheet to establish the data file, save
  it in Comma Separated Values format if necessary (this should not be
  necessary if the spreadsheet format is MS Excel, Gnumeric or Open
  Document), then use one of \app{gretl}'s ``Import'' options.
\item Use \app{gretl}'s built-in spreadsheet.
\item Select data series from a suitable database.
\item Use your favorite text editor or other software tools to a
  create data file in \app{gretl} format independently.
\end{enumerate}

Here are a few comments and details on these methods.

\subsection{Common points on imported data}

Options (1) and (2) involve using \app{gretl}'s ``import'' mechanism.
For \app{gretl} to read such data successfully, certain general
conditions must be satisfied:

\begin{itemize}

\item The first row must contain valid variable names.  A valid
  variable name is of 15 characters maximum; starts with a letter; and
  contains nothing but letters, numbers and the underscore character,
  \verb+_+.  (Longer variable names will be truncated to 15
  characters.)  Qualifications to the above: First, in the case of an
  ASCII or CSV import, if the file contains no row with variable names
  the program will automatically add names, \verb+v1+, \verb+v2+ and
  so on.  Second, by ``the first row'' is meant the first
  \emph{relevant} row.  In the case of ASCII and CSV imports, blank
  rows and rows beginning with a hash mark, \verb+#+, are ignored.  In
  the case of Excel and Gnumeric imports, you are presented with a
  dialog box where you can select an offset into the spreadsheet, so
  that \app{gretl} will ignore a specified number of rows and/or
  columns.
          
\item Data values: these should constitute a rectangular block, with
  one variable per column (and one observation per row).  The number
  of variables (data columns) must match the number of variable names
  given. See also section~\ref{missing-data}.  Numeric data are
  expected, but in the case of importing from ASCII/CSV, the program
  offers limited handling of character (string) data: if a given
  column contains character data only, consecutive numeric codes are
  substituted for the strings, and once the import is complete a table
  is printed showing the correspondence between the strings and the
  codes.
          
\item Dates (or observation labels): Optionally, the \emph{first}
  column may contain strings such as dates, or labels for
  cross-sectional observations.  Such strings have a maximum of 8
  characters (as with variable names, longer strings will be
  truncated).  A column of this sort should be headed with the string
  \verb+obs+ or \verb+date+, or the first row entry may be left
  blank.

  For dates to be recognized as such, the date strings must adhere to
  one or other of a set of specific formats, as follows.  For
  \emph{annual} data: 4-digit years.  For \emph{quarterly} data: a
  4-digit year, followed by a separator (either a period, a colon, or
  the letter \verb+Q+), followed by a 1-digit quarter.  Examples:
  \verb+1997.1+, \verb+2002:3+, \verb+1947Q1+.  For \emph{monthly}
  data: a 4-digit year, followed by a period or a colon, followed by a
  two-digit month.  Examples: \verb+1997.01+, \verb+2002:10+.
          
\end{itemize}

CSV files can use comma, space or tab as the column separator.  When
you use the ``Import CSV'' menu item you are prompted to specify the
separator.  In the case of ``Import ASCII'' the program attempts to
auto-detect the separator that was used.

If you use a spreadsheet to prepare your data you are able to carry
out various transformations of the ``raw'' data with ease (adding
things up, taking percentages or whatever): note, however, that you
can also do this sort of thing easily --- perhaps more easily ---
within \app{gretl}, by using the tools under the ``Add'' menu.

\subsection{Appending imported data}

You may wish to establish a \app{gretl} dataset piece by piece, by
incremental importation of data from other sources.  This is supported
via the ``File, Append data'' menu items:  \app{gretl} will check the
new data for conformability with the existing dataset and, if
everything seems OK, will merge the data.  You can add new variables
in this way, provided the data frequency matches that of the existing
dataset.  Or you can append new observations for data series that are
already present; in this case the variable names must match up
correctly.  Note that by default (that is, if you choose ``Open data''
rather than ``Append data''), opening a new data file closes the
current one.

\subsection{Using the built-in spreadsheet}

Under \app{gretl}'s ``File, New data set'' menu you can choose the
sort of dataset you want to establish (e.g.\ quarterly time series,
cross-sectional).  You will then be prompted for starting and ending
dates (or observation numbers) and the name of the first variable to
add to the dataset. After supplying this information you will be faced
with a simple spreadsheet into which you can type data values.  In the
spreadsheet window, clicking the right mouse button will invoke a
popup menu which enables you to add a new variable (column), to add an
observation (append a row at the foot of the sheet), or to insert an
observation at the selected point (move the data down and insert a
blank row.)

Once you have entered data into the spreadsheet you import these into
\app{gretl}'s workspace using the spreadsheet's ``Apply changes''
button.

Please note that \app{gretl}'s spreadsheet is quite basic and has no
support for functions or formulas.  Data transformations are done via
the ``Add'' or ``Variable'' menus in the main \app{gretl} window.

\subsection{Selecting from a database}

Another alternative is to establish your dataset by selecting
variables from a database.  

Begin with \app{gretl}'s ``File, Databases'' menu item. This has four
forks: ``Gretl native'', ``RATS 4'', ``PcGive'' and ``On database
server''.  You should be able to find the file \verb+fedstl.bin+ in
the file selector that opens if you choose the ``Gretl native'' option
--- this file, which contains a large collection of US macroeconomic
time series, is supplied with the distribution.

You won't find anything under ``RATS 4'' unless you have purchased
RATS data.\footnote{See \href{http://www.estima.com/}{www.estima.com}}
If you do possess RATS data you should go into \app{gretl}'s ``Tools,
Preferences, General'' dialog, select the Databases tab, and fill in
the correct path to your RATS files.  

If your computer is connected to the internet you should find several
databases (at Wake Forest University) under ``On database server''.
You can browse these remotely; you also have the option of installing
them onto your own computer.  The initial remote databases window has
an item showing, for each file, whether it is already installed
locally (and if so, if the local version is up to date with the
version at Wake Forest).

Assuming you have managed to open a database you can import selected
series into \app{gretl}'s workspace by using the ``Series, Import''
menu item in the database window, or via the popup menu that appears
if you click the right mouse button, or by dragging the series into
the program's main window.

\subsection{Creating a gretl data file independently}

It is possible to create a data file in one or other of \app{gretl}'s
own formats using a text editor or software tools such as \app{awk},
\app{sed} or \app{perl}.  This may be a good choice if you have large
amounts of data already in machine readable form. You will, of course,
need to study the \app{gretl} data formats (XML format or
``traditional'' format) as described in Appendix~\ref{app-datafile}.

\section{Structuring a dataset}
\label{sec:data-structure}

Once your data are read by \app{gretl}, it may be necessary to supply
some information on the nature of the data. We distinguish between
three kinds of datasets:
\begin{enumerate}
\item Cross section
\item Time series
\item Panel data
\end{enumerate}

The primary tool for doing this is the ``Data, Dataset structure''
menu entry in the graphical interface, or the \texttt{setobs} command
for scripts and the command-line interface.

\subsection{Cross sectional data}
\label{sec:cross-section-data}

By a cross section we mean observations on a set of ``units'' (which
may be firms, countries, individuals, or whatever) at a common point
in time.  This is the default interpretation for a data file: if
\app{gretl} does not have sufficient information to interpret data as
time-series or panel data, they are automatically interpreted as a
cross section.  In the unlikely event that cross-sectional data are
wrongly interpreted as time series, you can correct this by selecting
the ``Data, Dataset structure'' menu item.  Click the
``cross-sectional'' radio button in the dialog box that appears, then
click ``Forward''.  Click ``OK'' to confirm your selection.

\subsection{Time series data}
\label{sec:timeser-data}

When you import data from a spreadsheet or plain text file,
\app{gretl} will make fairly strenuous efforts to glean time-series
information from the first column of the data, if it looks at all
plausible that such information may be present.  If time-series
structure is present but not recognized, again you can use the ``Data,
Dataset structure'' menu item.  Select ``Time series'' and click
``Forward''; select the appropriate data frequency and click
``Forward'' again; then select or enter the starting observation and
click ``Forward'' once more.  Finally, click ``OK'' to confirm the
time-series interpretation if it is correct (or click ``Back'' to make
adjustments if need be).

Besides the basic business of getting a data set interpreted as time
series, further issues may arise relating to the frequency of
time-series data.  In a gretl time-series data set, all the series
must have the same frequency.  Suppose you wish to make a combined
dataset using series that, in their original state, are not all of the
same frequency.  For example, some series are monthly and some are
quarterly.

Your first step is to formulate a strategy: Do you want to end up with
a quarterly or a monthly data set?  A basic point to note here is
that ``compacting'' data from a higher frequency (e.g.\ monthly) to
a lower frequency (e.g.\ quarterly) is usually unproblematic.  You
lose information in doing so, but in general it is perfectly
legitimate to take (say) the average of three monthly observations to
create a quarterly observation.  On the other hand, ``expanding'' data
from a lower to a higher frequency is not, in general, a valid
operation.  

In most cases, then, the best strategy is to start by creating a data
set of the \textit{lower} frequency, and then to compact the higher
frequency data to match.  When you import higher-frequency data from a
database into the current data set, you are given a choice of
compaction method (average, sum, start of period, or end of period).
In most instances ``average'' is likely to be appropriate.  

You \textit{can} also import lower-frequency data into a
high-frequency data set, but this is generally not recommended.  What
\app{gretl} does in this case is simply replicate the values of the
lower-frequency series as many times as required. For example, suppose
we have a quarterly series with the value 35.5 in 1990:1, the first
quarter of 1990.  On expansion to monthly, the value 35.5 will be
assigned to the observations for January, February and March of 1990.
The expanded variable is therefore useless for fine-grained
time-series analysis, outside of the special case where you know that
the variable in question does in fact remain constant over the
sub-periods.

When the current data frequency is appropriate, \app{gretl} offers
both ``Compact data'' and ``Expand data'' options under the ``Data''
menu.  These options operate on the whole data set, compacting or
exanding all series.  They should be considered ``expert'' options
and should be used with caution. 


\subsection{Panel data}
\label{sec:panel-data}

Panel data are inherently three dimensional --- the dimensions being
variable, cross-sectional unit, and time-period.  For example, a
particular number in a panel data set might be identified as the
observation on capital stock for General Motors in 1980.  (A note on
terminology: we use the terms ``cross-sectional unit'', ``unit'' and
``group'' interchangeably below to refer to the entities that compose
the cross-sectional dimension of the panel.  These might, for
instance, be firms, countries or persons.)

For representation in a textual computer file (and also for gretl's
internal calculations) the three dimensions must somehow be flattened
into two.  This ``flattening'' involves taking layers of the data that
would naturally stack in a third dimension, and stacking them in the
vertical dimension.

\app{Gretl} always expects data to be arranged ``by observation'',
that is, such that each row represents an observation (and each
variable occupies one and only one column).  In this context the
flattening of a panel data set can be done in either of two ways:

\begin{itemize}
\item Stacked time series: the successive vertical blocks each
  comprise a time series for a given unit.
\item Stacked cross sections: the successive vertical blocks each
  comprise a cross-section for a given period.
\end{itemize}

You may input data in whichever arrangement is more convenient.
Internally, however, \app{gretl} always stores panel data in
the form of stacked time series.

When you import panel data into \app{gretl} from a spreadsheet or
comma separated format, the panel nature of the data will not be
recognized automatically (most likely the data will be treated as
``undated'').  A panel interpretation can be imposed on the data
using the graphical interface or via the \cmd{setobs} command.

In the graphical interface, use the menu item ``Data, Dataset
structure''.  In the first dialog box that appears, select ``Panel''.
In the next dialog you have a three-way choice.  The first two
options, ``Stacked time series'' and ``Stacked cross sections'' are
applicable if the data set is already organized in one of these two
ways.  If you select either of these options, the next step is to
specify the number of cross-sectional units in the data set.  The
third option, ``Use index variables'', is applicable if the data set
contains two variables that index the units and the time periods
respectively; the next step is then to select those variables.  For
example, a data file might contain a country code variable and a
variable representing the year of the observation.  In that case
\app{gretl} can reconstruct the panel structure of the data regardless
of how the observation rows are organized.

The \cmd{setobs} command has options that parallel those in the
graphical interface.  If suitable index variables are available
you can do, for example
%
\begin{code}
setobs unitvar timevar --panel-vars
\end{code}
%
where \texttt{unitvar} is a variable that indexes the units and
\texttt{timevar} is a variable indexing the periods.  Alternatively
you can use the form \verb+setobs+ \textsl{freq} \verb+1:1+
\textsl{structure}, where \textsl{freq} is replaced by the ``block
size'' of the data (that is, the number of periods in the case of
stacked time series, or the number of units in the case of stacked
cross-sections) and structure is either \option{stacked-time-series}
or \option{stacked-cross-section}.  Two examples are given below: the
first is suitable for a panel in the form of stacked time series with
observations from 20 periods; the second for stacked cross sections
with 5 units.
%
\begin{code}
setobs 20 1:1 --stacked-time-series
setobs 5 1:1 --stacked-cross-section
\end{code}

\subsubsection{Panel data arranged by variable}

Publicly available panel data sometimes come arranged ``by variable.''
Suppose we have data on two variables, \varname{x1} and \varname{x2},
for each of 50 states in each of 5 years (giving a total of 250
observations per variable).  One textual representation of such a data
set would start with a block for \varname{x1}, with 50 rows
corresponding to the states and 5 columns corresponding to the years.
This would be followed, vertically, by a block with the same structure
for variable \varname{x2}.  A fragment of such a data file is shown
below, with quinquennial observations 1965--1985.  Imagine the table
continued for 48 more states, followed by another 50 rows for variable
\varname{x2}.

\begin{center}
  \begin{tabular}{rrrrrr}
  \varname{x1} \\
     & 1965 & 1970 & 1975 & 1980 & 1985 \\
  AR & 100.0 & 110.5 & 118.7 & 131.2 & 160.4\\
  AZ & 100.0 & 104.3 & 113.8 & 120.9 & 140.6\\
  \end{tabular}
\end{center}

If a datafile with this sort of structure is read into
\app{gretl},\footnote{Note that you will have to modify such a
  datafile slightly before it can be read at all.  The line containing
  the variable name (in this example \varname{x1}) will have to be
  removed, and so will the initial row containing the years,
  otherwise they will be taken as numerical data.}  the program
will interpret the columns as distinct variables, so the data will not
be usable ``as is.''  But there is a mechanism for correcting the
situation, namely the \cmd{stack} function within the \cmd{genr}
command.

Consider the first data column in the fragment above: the first 50 rows
of this column constitute a cross-section for the variable \varname{x1}
in the year 1965.  If we could create a new variable by stacking the
first 50 entries in the second column underneath the first 50 entries
in the first, we would be on the way to making a data set ``by
observation'' (in the first of the two forms mentioned above, stacked
cross-sections).  That is, we'd have a column comprising a
cross-section for \varname{x1} in 1965, followed by a cross-section for
the same variable in 1970.

The following gretl script illustrates how we can accomplish the
stacking, for both \varname{x1} and \varname{x2}.  We assume
that the original data file is called \texttt{panel.txt}, and that in
this file the columns are headed with ``variable names'' \varname{p1},
\varname{p2}, \dots, \varname{p5}.  (The columns are not really
variables, but in the first instance we ``pretend'' that they are.)

\begin{code}
open panel.txt
genr x1 = stack(p1..p5) --length=50
genr x2 = stack(p1..p5) --offset=50 --length=50
setobs 50 1:1 --stacked-cross-section
store panel.gdt x1 x2
\end{code}

The second line illustrates the syntax of the \cmd{stack} function.
The double dots within the parentheses indicate a range of variables
to be stacked: here we want to stack all 5 columns (for all 5 years).
The full data set contains 100 rows; in the stacking of variable
\varname{x1} we wish to read only the first 50 rows from each column:
we achieve this by adding \verb+--length=50+.  Note that if you want
to stack a non-contiguous set of columns you can give a
comma-separated list of variable names, as in
%
\begin{code}
genr x = stack(p1,p3,p5)
\end{code}
%
or you can provide within the parentheses the name of a previously
created list (see chapter~\ref{chap-persist}).

On line 3 we do the stacking for variable \varname{x2}.  Again we want
a \texttt{length} of 50 for the components of the stacked series, but
this time we want gretl to start reading from the 50th row of the
original data, and we specify \verb+--offset=50+.
Line 4 imposes a panel interpretation on the data; finally, we save
the data in gretl format, with the panel interpretation, discarding
the original ``variables'' \varname{p1} through \varname{p5}.

The illustrative script above is appropriate when the number of
variable to be processed is small.  When then are many variables in
the data set it will be more efficient to use a command loop to
accomplish the stacking, as shown in the following script.  The setup
is presumed to be the same as in the previous section (50 units, 5
periods), but with 20 variables rather than 2.

\begin{code}
open panel.txt
loop for i=1..20
  genr k = ($i - 1) * 50
  genr x$i = stack(p1..p5) --offset=k --length=50
endloop
setobs 50 1.01 --stacked-cross-section
store panel.gdt x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 \
  x11 x12 x13 x14 x15 x16 x17 x18 x19 x20
\end{code}

\subsubsection{Panel data marker strings}

It can be helpful with panel data to have the observations identified
by mnemonic markers.  A special function in the \texttt{genr} command
is available for this purpose.

In the example above, suppose all the states are identified by
two-letter codes in the left-most column of the original datafile.
When the stacking operation is performed, these codes will be stacked
along with the data values.  If the first row is marked \texttt{AR}
for Arkansas, then the marker \texttt{AR} will end up being shown on
each row containing an observation for Arkansas.  That's all very
well, but these markers don't tell us anything about the date of the
observation.  To rectify this we could do:

\begin{code}
genr time
genr year = 1960 + (5 * time)
genr markers = "%s:%d", marker, year
\end{code}

The first line generates a 1-based index representing the period of
each observation, and the second line uses the \texttt{time} variable
to generate a variable representing the year of the observation.  The
third line contains this special feature: if (and only if) the name of
the new ``variable'' to generate is \texttt{markers}, the portion of
the command following the equals sign is taken as C-style format
string (which must be wrapped in double quotes), followed by a
comma-separated list of arguments.  The arguments will be printed
according to the given format to create a new set of observation
markers.  Valid arguments are either the names of variables in the
dataset, or the string \texttt{marker} which denotes the pre-existing
observation marker.  The format specifiers which are likely to be
useful in this context are \texttt{\%s} for a string and \texttt{\%d}
for an integer.  Strings can be truncated: for example \texttt{\%.3s}
will use just the first three characters of the string.  To chop
initial characters off an existing observation marker when
constructing a new one, you can use the syntax \texttt{marker + n},
where \texttt{n} is a positive integer: in the case the first
\texttt{n} characters will be skipped.

After the commands above are processed, then, the observation markers
will look like, for example, \texttt{AR:1965}, where the two-letter
state code and the year of the observation are spliced together with a
colon.

\section{Missing data values}
\label{missing-data}

These are represented internally as \verb+DBL_MAX+, the largest
floating-point number that can be represented on the system (which is
likely to be at least 10 to the power 300, and so should not be
confused with legitimate data values).  In a native-format data file
they should be represented as \verb+NA+. When importing CSV data
\app{gretl} accepts several common representations of missing values
including $-$999, the string \verb+NA+ (in upper or lower case), a
single dot, or simply a blank cell.  Blank cells should, of course, be
properly delimited, e.g.\ \verb+120.6,,5.38+, in which the middle value
is presumed missing.

As for handling of missing values in the course of statistical
analysis, \app{gretl} does the following:

\begin{itemize}
\item In calculating descriptive statistics (mean, standard deviation,
  etc.) under the \cmd{summary} command, missing values are simply
  skipped and the sample size adjusted appropriately.
\item In running regressions \app{gretl} first adjusts the beginning
  and end of the sample range, truncating the sample if need be.
  Missing values at the beginning of the sample are common in time
  series work due to the inclusion of lags, first differences and so
  on; missing values at the end of the range are not uncommon due to
  differential updating of series and possibly the inclusion of leads.
\end{itemize}

If \app{gretl} detects any missing values ``inside'' the (possibly
truncated) sample range for a regression, the result depends on the
character of the dataset and the estimator chosen.  In many cases, the
program will automatically skip the missing observations when
calculating the regression results.  In this situation a message is
printed stating how many observations were dropped.  On the other
hand, the skipping of missing observations is not supported for all
procedures: exceptions include all autoregressive estimators, system
estimators such as SUR, and nonlinear least squares.  In the case of
panel data, the skipping of missing observations is supported only if
their omission leaves a balanced panel. If missing observations are
found in cases where they are not supported, \app{gretl} gives an
error message and refuses to produce estimates.

In case missing values in the middle of a dataset present a problem,
the \cmd{misszero} function (use with care!) is provided under the
\cmd{genr} command. By doing \cmd{genr foo = misszero(bar)} you can
produce a series \cmd{foo} which is identical to \cmd{bar} except that
any missing values become zeros.  Then you can use carefully
constructed dummy variables to, in effect, drop the missing
observations from the regression while retaining the surrounding
sample range.\footnote{\cmd{genr} also offers the inverse function to
  \cmd{misszero}, namely \cmd{zeromiss}, which replaces zeros in a
  given series with the missing observation code.}

\section{Maximum size of data sets}
\label{data-limits}

Basically, the size of data sets (both the number of variables and the
number of observations per variable) is limited only by the
characteristics of your computer.  \app{Gretl} allocates memory
dynamically, and will ask the operating system for as much memory as
your data require.  Obviously, then, you are ultimately limited by the
size of RAM.

Aside from the multiple-precision OLS option, gretl uses
double-precision floating-point numbers throughout.  The size of such
numbers in bytes depends on the computer platform, but is typically
eight.  To give a rough notion of magnitudes, suppose we have a data
set with 10,000 observations on 500 variables.  That's 5 million
floating-point numbers or 40 million bytes.  If we define the megabyte
(MB) as $1024 \times 1024$ bytes, as is standard in talking about RAM,
it's slightly over 38 MB.  The program needs additional memory for
workspace, but even so, handling a data set of this size should be
quite feasible on a current PC, which at the time of writing is likely
to have at least 256 MB of RAM.  

If RAM is not an issue, there is one further limitation on data size
(though it's very unlikely to be a binding constraint).  That is,
variables and observations are indexed by signed integers, and on a
typical PC these will be 32-bit values, capable of representing
a maximum positive value of $2^{31} - 1 = 2,147,483,647$.

The limits mentioned above apply to \app{gretl}'s ``native''
functionality.  There are tighter limits with regard to two
third-party programs that are available as add-ons to \app{gretl} for
certain sorts of time-series analysis including seasonal adjustment,
namely \app{TRAMO/SEATS} and \app{X-12-ARIMA}.  These programs employ
a fixed-size memory allocation, and can't handle series of more than
600 observations.


\section{Data file collections}
\label{collections}

If you're using \app{gretl} in a teaching context you may be
interested in adding a collection of data files and/or scripts that
relate specifically to your course, in such a way that students can
browse and access them easily.

There are three ways to access such collections of files:

\begin{itemize}
\item For data files: select the menu item ``File, Open data, Sample
  file'', or click on the folder icon on the \app{gretl} toolbar.
\item For script files: select the menu item ``File, Script
  files, Practice file''.
\end{itemize}

When a user selects one of the items:

\begin{itemize}
\item The data or script files included in the gretl distribution are
  automatically shown (this includes files relating to Ramanathan's
  \emph{Introductory Econometrics} and Greene's \emph{Econometric
    Analysis}).
\item The program looks for certain known collections of data files
  available as optional extras, for instance the datafiles from
  various econometrics textbooks (Davidson and MacKinnon, Gujarati,
  Stock and Watson, Verbeek, Wooldridge) and the Penn World Table (PWT
  5.6).  (See \href{http://gretl.sourceforge.net/gretl_data.html}{the
    data page} at the gretl website for information on these
  collections.)  If the additional files are found, they are added to
  the selection windows.
\item The program then searches for valid file collections (not
  necessarily known in advance) in these places: the ``system'' data
  directory, the system script directory, the user directory, and all
  first-level subdirectories of these.  For reference, typical values
  for these directories are shown in Table~\ref{tab-colls}.  (Note that
  \texttt{PERSONAL} is a placeholder that is expanded by Windows,
  corresponding to ``My Documents'' on English-language systems.)
\end{itemize}

\begin{table}[htbp]
  \begin{center}
    \begin{tabular}{lll}
        & \multicolumn{1}{c}{\textit{Linux}} & 
        \multicolumn{1}{c}{\textit{MS Windows}} \\
        system data dir & 
        {\small \verb+/usr/share/gretl/data+} &
        {\small \verb+c:\Program Files\gretl\data+} \\
        system script dir & 
        {\small \verb+/usr/share/gretl/scripts+} &
        {\small \verb+c:\Program Files\gretl\scripts+} \\
        user dir & 
        {\small \verb+$HOME/gretl+} &
        {\small \verb+PERSONAL\gretl+}\\
  \end{tabular}
 \end{center}
 \caption{Typical locations for file collections}
 \label{tab-colls}
\end{table}

Any valid collections will be added to the selection windows. So what
constitutes a valid file collection?  This comprises either a set of
data files in \app{gretl} XML format (with the \verb+.gdt+ suffix) or
a set of script files containing gretl commands (with \verb+.inp+
suffix), in each case accompanied by a ``master file'' or catalog.
The \app{gretl} distribution contains several example catalog files,
for instance the file \verb+descriptions+ in the \verb+misc+
sub-directory of the \app{gretl} data directory and
\verb+ps_descriptions+ in the \verb+misc+ sub-directory of the scripts
directory.

If you are adding your own collection, data catalogs should be named
\verb+descriptions+ and script catalogs should be be named
\verb+ps_descriptions+.  In each case the catalog should be placed
(along with the associated data or script files) in its own specific
sub-directory (e.g.\ \url{/usr/share/gretl/data/mydata} or
\verb+c:\userdata\gretl\data\mydata+).

The syntax of the (plain text) description files is straightforward.
Here, for example, are the first few lines of gretl's ``misc'' data
catalog:

\begin{code}
# Gretl: various illustrative datafiles
"arma","artificial data for ARMA script example"
"ects_nls","Nonlinear least squares example"
"hamilton","Prices and exchange rate, U.S. and Italy"
\end{code}

The first line, which must start with a hash mark, contains a short
name, here ``Gretl'', which will appear as the label for this
collection's tab in the data browser window, followed by a colon,
followed by an optional short description of the collection.

Subsequent lines contain two elements, separated by a comma and
wrapped in double quotation marks.  The first is a datafile name
(leave off the \verb+.gdt+ suffix here) and the second is a short
description of the content of that datafile.  There should be one such
line for each datafile in the collection.

A script catalog file looks very similar, except that there are three
fields in the file lines: a filename (without its \verb+.inp+ suffix),
a brief description of the econometric point illustrated in the
script, and a brief indication of the nature of the data used.  Again,
here are the first few lines of the supplied ``misc'' script catalog:

\begin{code}
# Gretl: various sample scripts
"arma","ARMA modeling","artificial data"
"ects_nls","Nonlinear least squares (Davidson)","artificial data"
"leverage","Influential observations","artificial data"
"longley","Multicollinearity","US employment"
\end{code}

If you want to make your own data collection available to users, these
are the steps:

\begin{enumerate}
\item Assemble the data, in whatever format is convenient.
\item Convert the data to \app{gretl} format and save as \verb+gdt+
  files.  It is probably easiest to convert the data by importing them
  into the program from plain text, CSV, or a spreadsheet format (MS
  Excel or Gnumeric) then saving them. You may wish to add
  descriptions of the individual variables (the ``Variable, Edit
  attributes'' menu item), and add information on the source of the
  data (the ``Data, Edit info'' menu item).
\item Write a descriptions file for the collection using a text
  editor.
\item Put the datafiles plus the descriptions file in a subdirectory
  of the \app{gretl} data directory (or user directory).
\item If the collection is to be distributed to other people, package
  the data files and catalog in some suitable manner, e.g.\ as a
  zipfile.
\end{enumerate}

If you assemble such a collection, and the data are not proprietary, we
would encourage you to submit the collection for packaging as a
\app{gretl} optional extra.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "gretl-guide"
%%% End: