Sophie

Sophie

distrib > Mandriva > 2008.0 > i586 > by-pkgid > 3d0d0177db421ffde0b64948d214366a > files > 96

polyxmass-doc-0.9.0-1mdv2007.0.noarch.rpm

\chapter[\pxm\ Generalities]{\pxm\
Generalities}

\label{chap:polyxmass-generalities}

In this chapter, I wish to introduce some general concepts around the
\pxm\ program. 

\renewcommand{\sectitle}{General \pxm\ Concepts}
\section*{\sectitle}
\addcontentsline{toc}{section}{\numberline{}\sectitle}

The \pxm\ mass spectrometry software suite has been designed to be
able to ``work'' with every polymer on earth. Well, in a certain way
this is true\dots\ A more faithful account of the \pxm' capabilities
would be: ``\emph{The \pxm\ software suite works with whatever polymer
  chemistry the user cares to define; the more accurate the polymer
  chemistry definition, the more \pxm\ will be accurate}''. Sounds
like much of the responsibility for the proper functioning of the
\pxm\ framework is in the hands of the user?  That is true! However,
with \pxm\ the user has a framework at hand to define polymer
chemistries so as to suit his needs.

The main concept that drove the design of the entire \pxm\ framework
is \emph{abstraction}. Indeed, for the program to be able to
understand a variety of possibly very different polymers, it had to be
written using some \emph{abstraction layer} between the way masses are
computed and the way the polymer is described ``in memory''. This
abstraction layer is implemented by using a ``polymer chemistry
definition-driven'' set of functionalities. The polymer chemistry
definition drives all the mass computations, all the polymer sequence
editing, all the polymer chemistry reactions\dots\ This is how the
\pxm\ software suite makes it possible to handle any polymer type. To
implement this abstraction paradigm, the \pxm\ mass spectrometry
framework was designed to be modular, as described below.

The \pxm mass spectrometry software suite comprises the following
packages (not all of them installing actual binary/executable
programs):

\begin {enumerate}
\item \pxmbin\ (and \pxmbincommon\ in some distributions, like Debian)
  {\footnotesize this is the binary package enshrining the
    \progname{\pxmng} binary program. This is where the user will
    spend most of his time: doing either polymer chemistry definitions
    (\pxd menu), mass calculations (\pxc menu) or real polymer
    sequence chemical simulations along with mass spectrometry
    simulations (\pxe menu)};
\item \pxmcommon\ {\footnotesize this is a non-binary package where
    the essential configuration/data files are stored, like the
    scripts that are used to update the catalogues of available
    polymer chemistry or atom definitions. This package comes with the
    basic atom definition file and an example polymer chemistry
    definition (``protein'')};
\item \pxmdata\ {\footnotesize this is an \emph{optional} non-binary
    package where other example polymer chemistry definitions are
    delivered, so that the user might learn how to prepare other
    packages to submit to the \pxm development team for incorporation
    in the \pxm software suite as official packages};
\end {enumerate} 

\noindent In the rest of this manual we shall call ``module'' a set of
functionalities that are aimed at a specific task: for example, all
the functionalities that are accessible in the \progname{\pxmng} binary program
with the aim of defining polymer chemistries will be called the ``\pxd
module'' and will be triggered by using the menu tree rooted at the
``\pxd'' menu item.

\bigskip

The fact that the \pxm\ software suite is able to handle any polymer
chemistry is, as we said above, due to its ability to interface a
polymer sequence with a polymer chemistry definition. To explain this
clearly, imagine a protein sequence that would be this tetrapeptide:
``ATGC'', which reads as ``AlanineThreonineGlycineCysteine''. Now
imagine the same ``ATGC'' sequence but as a DNA sequence, which reads
as ``AdenineThymineGuanineCytosine''.  The two sequences would be
entered in a sequence editor by keying in the following key sequence:
\kbdKey{A}\kbdKey{T}\kbdKey{G}\kbdKey{C}. Of course, while the
sequence is identical in both cases, you'd expect that the masses for
the DNA sequence be much higher than the masses for the protein
sequence. 

This is where ``abstraction'' comes in, and modularity also: in order
to let the user perform the required computations as flexibly as
possible, she first defines two different polymer chemistries: the
first named ``protein'' and the second named ``dna''. In each of the
two distinct polymer chemistry definitions, the user will enter a
formula corresponding to each monomer (A,T,G,C).  Of course, the
monomer formula for a Threonine is very different than the one for a
Thymine.  

The definition of the polymer chemistry is performed in the \pxd
module that is accessible in the \pxmng program under the ``\pxd''
menu item.  Once a polymer chemistry definition is saved, it may be
made available to the system (we'll see how this is done). And when a
polymer chemistry definition is made available to the system, any new
polymer sequence may be created that abides by this polymer chemistry
definition.

By defining precisely the chemical behaviour of a polymer type, and
making an association between a given polymer chemistry definition and
a polymer sequence, the user makes use of the \emph {abstraction
  layer} that we mentioned above.  Once this is well understood, the
originality of the \pxm software framework is understood. This is
precisely what sets \pxm apart from the other mass
spectrometry-related software offerings.

Since the different functionalities offered by the \pxm framework are
well confined in three distinct modules, all accessible from the
\progname{\pxmng} binary program, but sitting in clearly distinct menu
trees, we'll review each of such ``modules'' in later chapters.

Before going on with the description of the different modules, I would
like to introduce some other more chemistry-oriented concepts that are
going to be used throughout the \pxm framework.

\renewcommand{\sectitle}{On Formulae And Chemical Reactions}
\section*{\sectitle}
\addcontentsline{toc}{section}{\numberline{}\sectitle}

It is all the more frequent for any user who runs any of the \pxm'
modules to make use of formulae or of chemical reactions. These two
chemical entities are not identical in \pxm. While a formula
represents a chemical status (a monomer has a given formula, and does
not change it), a chemical reaction is something much more dynamic, I
should say ``active''.

This difference is very important in \pxm. Let's take an example: the
Lysyl monomer (we call a protein ``residue'' a ``monomer'') has the
following formula: $\mathrm{C_6H_{12}N_2O}$. If I wish to acetylate
this Lysyl monomer, the reaction will read this way: ``An acetic acid
molecule will condense onto the amine of the Lysyl side chain''. This
can also read: ---\textsl{``An acetyl group enters the Lysyl side
  chain while a hydrogen atom leaves the Lysyl side chain; water is
  lost in the process''}. If we wanted to put this into a more
chemistry-oriented representation, we could write this:

\[\mathrm{R-NH_2 + CH_3COOH \rightleftharpoons R-NH-CO-CH_3 + H_2O}\]

That is more briefly stated this other way: ``$\mathrm
{-H_2O+CH_3COOH}$''. This is exactly what \pxm\ calls an \emph
{``actionformula''} ---or, for brevity--- an \emph {``actform''}: just
because actions are associated with formulae; here the $\mathrm
{H_2O}$ formula is associated with the $\mathrm {-}$, which indicates
that the water molecule leaves the molecules being reacted, while the
$\mathrm {CH_3COOH}$ formula is associated with the $\mathrm {+}$,
which means that the acetic acid molecule enters in to the target
molecule. The net formula is thus, as stated earlier: ---\textsl{``An
  acetyl group enters the Lysyl side chain while a hydrogen atom
  leaves the Lysyl side chain; water is lost in the process''}.

In the \pxm\ framework, the \emph {formula} and \emph {actform}
chemical entities are \emph {not} interchangeable.


\renewcommand{\sectitle}{The \pxm\ Framework Data Format}
\section*{\sectitle}
\addcontentsline{toc}{section}{\numberline{}\sectitle}

All the data in the \pxm\ framework are stored on disk as
\fileformat{XML}-formatted files. \fileformat{XML} is the
\emph{eXtensible Markup Language}. This ``language'' allows to
describe the structure of a document. Have you ever opened an
\fileformat{HTML} file with a text editor? If so, you have certainly
seen some markup like \verb|<H1>This is the title</H1>|. The browser
that loads this file will understand (because it has been programmed
to do so) that the title ``This is the title'' is to be displayed onto
the screen using a bold sans-serif font, for example. Well, let us
just say that the \fileformat{XML} file format is an immensely more
powerful equivalent of \fileformat{HTML}.

There would be a lot\dots a lot to say about \fileformat{XML} and
\emph{Document Type Definition}\/s: I'll refrain from entering into
the details.

The big advantage of using such \fileformat{XML} format in \pxm\ is
that it is a text format, and not a binary one. This means that any
data in the \pxm\ package is human-readable (even if the
\fileformat{XML} syntax makes it a bit difficult to read data, it is
actually possible). Try to read one polymer chemistry definition
\filename{.xml} file from the \pxmdata\ package (say, the \filename
{dna-sample.xml} file, for example), and you'll see that this is
pure text (the same applies for the \filename{.pxm} polymer sequence
files in the same package.  The advantages of using text file formats,
with respect to binary file formats are:

\begin{itemize}
\item if somebody sends you a file and you do not have the program
  that made it, you still can extract information from the file,
  because it is readable with any text editor;
\item if a text file (such as your most important polymer sequence
  \fileformat {XML} file) gets corrupted for some reason
  (\textit{i.e.} during backup on a bad support, or whatever) you will
  still be able to extract from the corrupted file all the bits of
  information that surround the portion that is corrupted, thus
  minimizing the data loss. This would be impossible with binary
  files, as they are just totally useless if a single part of them is
  corrupted;
\item imagine you would like to write down a simple script that would
  allow you to find ---in a given directory--- all the sequence files
  that contain the ``myo'' character string in the polymer's name
  field (in \fileformat{XML} a field is called \emph{element}). You
  can do it easily \emph{without} asking anybody for the file format
  specification ---because your sequence files are just text files.
\end{itemize} 

As an example of how simple it is I'll just write a \software{bash}
shell script below that I'll save into the \filename{polname-find.sh}
file in order to execute it afterwards.  That is how the shell script
looks like in the
\filename{polname-find.sh} file:\\
 
\prompt\ \command{cat} \filename{polname-find.sh} \kbdEnterKey\ 
\begin{alltt}\textsl{ 
    for i in *.pxm
      do grep "<name>.*myo.*</name>" \$i ;
      if [ \$? == 0 ]
        then 
          echo "in file \$i"
      fi
      done}
\end{alltt}

Now we should make this brand new file executable so we can run it:

\prompt\ \command{chmod} \option{u+x} \filename{polname-find.sh}
\kbdEnterKey\  

Upon execution of this script, the output looks like this:

\prompt\ \filename{./polname-find.sh} \kbdEnterKey\

\begin{alltt}\textsl{
    <name>myoglobin-horse</name>
    in file myoglob-h.pxm
    <name>myosin-chicken</name>
    in file myos-chck.pxm
    <name>myo-fragment1</name>
    in file myofrag1.pxm
    <name>apomyoglobin-rabbit</name>
    in file apomyo-rbt.pxm}
\end{alltt}

The script has gone through all the \filename{*.pxm} files and for
each file has searched a start tag \verb|<name>| followed by some
string containing ``myo'' followed by the end tag \verb|</name>|. If
``myo'' is found, the corresponding line is printed to the screen, and
the name of the file containing this pattern is printed also. 

With a binary file format this would have been impossible. This little
script lets you screen a big database like a snap. That's the power of
\OSname{UNIX} and \OSname{UNIX}-like operating systems.

\renewcommand{\sectitle}{Editing the Data in \pxm\ Files}
\section*{\sectitle}
\addcontentsline{toc}{section}{\numberline{}\sectitle}

The aim of \pxm\ is to let people use the software the way they like,
with no preconception on the way they interact with it. The
\fileformat {XML} files (polymer sequence or polymer chemistry
definition files) can be edited using the graphical interface but also
using a simple text editor. Figure~\ref{fig:poldef-comp-gtk-emacs}
shows two rather different means to the same end: editing a polymer
chemistry definition file.  The Document Type Definition (DTD) is not
shown on the right pane of the figure, since it is at the top of the
file being displayed. This DTD will help the user to determine how to
edit the file in a safe way, by telling where each element is
authorized to be, and so on\dots You'll need to learn \fileformat{XML}
if you wish to understand the DTD (a sunday afternoon will suffice).
Usually, the safer way to do any editing is by using the graphical
interface, not because the \pxm\ framework understands the edited data
better this way, but because the graphical interface layout (acting
like a data correctness censor) just prevents the user from writing
badly-formed data directly in the \fileformat{XML} file.

\begin{figure}
  \begin{center}
    \includegraphics[scale=0.7]{figures/raster/poldef-comp-gtk-emacs.png} 
  \end{center}
  \caption[Graphical and text editing of a polymer chemistry 
  definition]{\textbf{Comparison of a graphical and a text way of
      editing a polymer chemistry definition file.} The left pane
    shows the graphical interface that is exposed to the user when
    defining a polymer. The right pane shows the same \fileformat{XML}
    file opened in the \progname{Emacs} editor with the
    \fileformat{XML} editing mode switched on.}
  \label{fig:poldef-comp-gtk-emacs}
\end{figure}

The example shown in Figure~\ref{fig:poldef-comp-gtk-emacs} can be
transposed to the polymer sequence \fileformat{XML} files in a very
same way. Of course all the process that leads to ``creating'' a new
polymer chemistry definition is going to be explained in detail in a
later chapter (see chapter~\ref{chap:polyxdef},
page~\pageref{chap:polyxdef}).


\renewcommand{\sectitle}{General Polymer Element Naming Policy}
\section*{\sectitle}
\addcontentsline{toc}{section}{\numberline{}\sectitle}

\label{sect:polymer_element_naming_policy}

Unless otherwise specified, it is \emph{strongly} suggested \emph{not}
to insert any non-alphanumeric-non-ASCII character (space, \%, \#,
\$\dots) in the strings that the user enters to identify polymer
chemistry definition items. This means that, for example, the user
must refrain from using non-alphanumeric-non-ASCII characters for the
atom name and symbol, the name, the code or the formula of the
monomers or of the modifications, or of the cleavage specifications,
or of the fragmentation specifications\dots\ Usually, the accepted
delimiting characters are `\-' and `\_'. It is important not to
cripple these polymer data for two main reasons:

\begin{itemize}
\item so that the program performs smoothly (some parsing processes
  rely on specific characters (like `\#' or `\%', for example) to
  isolate sub-strings starting from larger strings);
\item so that the results can be easily and clearly displayed when
  time comes to print all the data.
\end{itemize}


\renewcommand{\sectitle}{Graphical Interface Design}
\section*{\sectitle}
\addcontentsline{toc}{section}{\numberline{}\sectitle}

For those coming to \OSname{UNIX} after having used \OSname{MS
  Windows} (like me), I would like to state some general graphical
interface design specificities of the \OSname{UNIX} world. The
\OSname{MS Windows} graphical environment was designed in such a way
that the user is very strictly restricted to a narrow path each time
she initiates an action. That policy has often led to arbitrary
limitations in the design of software running on the \OSname{MS
  Windows} systems.

This is not going to be exactly the same with a \OSname{UNIX}
graphical environment: you almost certainly are going to quickly have
a great number of windows opened on your desktop; you are the one who
knows when to close a results window, not the program designer. When a
window is opened, it is not going to be systematically required that
it be closed before opening another one. This has a simple reason:
imagine that you wanted to compare the oligomers generated by using
two different enzymes on the same polymer sequence; you'll need both
results windows to be opened at the same time, otherwise how comparison
of oligomers could happen? That reasoning is true for a number of
situations, and ---yes--- you'll be responsible for closing the
windows you do not need anymore!

This general behaviour is highly desirable, since it indeed allows the
user to make comparisons between the data from two different
experiments right after having generated the data. But this behaviour
introduces a risk: how will it be possible to ascertain that any given
set of peptides does come from the cleavage of the first protein using
cleaving-agent-1 and not from the cleavage of the first protein using
cleaving-agent-2? In other words: how are you going to recognize which
results window contains the peptides of the first cleavage, and which
results window contains the peptides obtained from the second
cleavage?  There is an answer: each time a window is displayed ---if
there is a risk of ambiguity--- it will show the identity number
(\guilabel{ID number}) of the polymer to which it is related. This ID
number is nothing else but the \emph{unique} memory address of the
polymer sequence editing context to which the window is related.

In any situation where an ambiguity exists about the identity of the
data generated on any given polymer sequence, a traceability system is
used, as shown in Figure~\ref{fig:identities-follow-results}.

\begin{figure}
  \begin{center}
    \includegraphics[scale=1.75]
    {figures/raster/identities-follow-results.png}
  \end{center}
  \caption[Identity of polymer sequences]{\textbf{Unambiguous
      identification of polymer sequences and related data.} When a
    polymer sequence is loaded/created, it is assigned a numeric value
    that unambiguously identifies it {\small (for the programmer, this
      is the pointer to the polymer structure)}. Each time a window is
    displayed that contains data pertaining to any given polymer
    sequence (oligomers generated by cleavage of a given polymer
    sequence, for example), it is given a reference to the polymer
    whence the data came, and this reference is the polymer's identity
    number. This is clearly visible here: the polymer sequence has a
    given \guilabel{ID Number} and all the related windows display
    that same number.  Note that the cleavage results data have
    another \guilabel{ID Number} that is later used to trace the mass
    find results data (last bottom window).}
  \label{fig:identities-follow-results}
\end{figure}


\renewcommand{\sectitle}{Feedback From \pxm\ To The User: The Console Window}
\section*{\sectitle}
\addcontentsline{toc}{section}{\numberline{}\sectitle}

Something very specific to the \OSname{UNIX} and \OSname{UNIX}-like
systems (and that I really like) is the fact that the programs are
usually designed to be ``verbose'' (if the user asks this). The usual
means to giving feedback in other systems is to pop up a ``dialog''
window in which a message is displayed, and the user has to
acknowledge (typically by clicking onto a button widget) in order to
close the dialog window. \pxm\ has been implemented with the
``console'' philosophy in mind: every message that it wishes to ``hand
out'' to the user is sent to the terminal window from which the
program was started.

There are two levels of very important messages: the \emph{CRITICAL}
and the \emph{ERROR} level messages. The CRITICAL-level messages
indicate that time has come to make a quick save of all the data,
because something bad might happen. ERROR-level messages cannot even
be read in the console window, because they elicit an abortion of the
program.  These abortions are voluntary on the \pxm' part, because the
error is so bad that it would crash anyway soon or later.

Each time a message (of any importance level) is issued to the user,
the console window is presented to the user (if it was hidden or
minimized, this console window is show afresh). Figure~\ref{fig:polyxmass-console-wnd}
shows the console window with a warning.

\begin{figure}
  \begin{center}
    \includegraphics[scale=1.75]{figures/raster/polyxmass-console-wnd.png}
  \end{center}
  \caption[The console window]{\textbf{The console window where any
      messages to the user are displayed.} Depending on the importance
    level of the message being issued to the user, the color will be
    more or less ``reddish''.}
  \label{fig:polyxmass-console-wnd}
\end{figure}



\renewcommand{\sectitle}{Window Management}
\section*{\sectitle}
\addcontentsline{toc}{section}{\numberline{}\sectitle}

\pxm is powerful and flexible: any number of polymer sequences can be
opened at any given time, and any number of simulations might be
performed on any of these polymer sequences. This might lead to a huge
number of windows opened on the desktop at any given time. There are
two main types of windows:

\begin{itemize}

\item windows that do not display results. These window are typically
  windows where the user is provided with options to perform some
  action. For example, one such window might be the window that allows
  the user to select an enzyme when an enzymatic cleavage is required
  on a polymer sequence;

\item windows that are responsible for displaying a polymer sequence,
  the results of some simulation or of any computation. For example, a
  window displaying results (\emph{results window}, for short) might
  be the window that displays all the oligomers obtained upon cleavage
  of a polymer sequence using one enzymatic agent; or the window where
  all the fragmentation oligomers are displayed after the gas-phase
  fragmentation of a polymer sequence.  Other examples are the windows
  where the monomeric composition of the polymer sequence is displayed
  and where the pH/pKa/pI computation results are displayed\dots

\end{itemize}
  
In order to ease the management of all the results windows opened at
any given time, a window management facility was devised. Its
incarnation is shown in
Figure~\vref{fig:polyxmass-window-management}. This window is called by
using the main program's window menu 

\centerline{\guimenu{View}\guimenuitem{Window List}\\}


\begin{figure}
  \begin{center}
    \includegraphics[scale=1.75]
    {figures/raster/polyxmass-window-management.png}
  \end{center}
  \caption[The window management facility]{\textbf{The window
      management facility.} Each time a polymer sequence window ---or
    a window where results are displayed--- is opened, it is
    registered and appears in the treeview on the left of the depicted
    window. The user can then select any window of interest and
    perform actions about this window.}
  \label{fig:polyxmass-window-management}
\end{figure}

The window management operations include the following actions, that
apply to the window item currently selected in the \guilabel{Available
  Windows} treeview on the left of the window:

\begin{itemize}

\item \guilabel{Show Window:} {\footnotesize force a hidden/minimized
    window to show itself;}
  
\item \guilabel{Hide Window:} {\footnotesize force a window to hide
    itself.}

\end {itemize}
  

\noindent As soon as a window that is listed in the
\guilabel{Available Windows} treeview is closed, its corresponding
item in the treeview is removed.









\cleardoublepage


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "polyxmass"
%%% End: