\chapter[\pxm\ Generalities]{\pxm\ Generalities} \label{chap:polyxmass-generalities} In this chapter, I wish to introduce some general concepts around the \pxm\ program. \renewcommand{\sectitle}{General \pxm\ Concepts} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} The \pxm\ mass spectrometry software suite has been designed to be able to ``work'' with every polymer on earth. Well, in a certain way this is true\dots\ A more faithful account of the \pxm' capabilities would be: ``\emph{The \pxm\ software suite works with whatever polymer chemistry the user cares to define; the more accurate the polymer chemistry definition, the more \pxm\ will be accurate}''. Sounds like much of the responsibility for the proper functioning of the \pxm\ framework is in the hands of the user? That is true! However, with \pxm\ the user has a framework at hand to define polymer chemistries so as to suit his needs. The main concept that drove the design of the entire \pxm\ framework is \emph{abstraction}. Indeed, for the program to be able to understand a variety of possibly very different polymers, it had to be written using some \emph{abstraction layer} between the way masses are computed and the way the polymer is described ``in memory''. This abstraction layer is implemented by using a ``polymer chemistry definition-driven'' set of functionalities. The polymer chemistry definition drives all the mass computations, all the polymer sequence editing, all the polymer chemistry reactions\dots\ This is how the \pxm\ software suite makes it possible to handle any polymer type. To implement this abstraction paradigm, the \pxm\ mass spectrometry framework was designed to be modular, as described below. The \pxm mass spectrometry software suite comprises the following packages (not all of them installing actual binary/executable programs): \begin {enumerate} \item \pxmbin\ (and \pxmbincommon\ in some distributions, like Debian) {\footnotesize this is the binary package enshrining the \progname{\pxmng} binary program. This is where the user will spend most of his time: doing either polymer chemistry definitions (\pxd menu), mass calculations (\pxc menu) or real polymer sequence chemical simulations along with mass spectrometry simulations (\pxe menu)}; \item \pxmcommon\ {\footnotesize this is a non-binary package where the essential configuration/data files are stored, like the scripts that are used to update the catalogues of available polymer chemistry or atom definitions. This package comes with the basic atom definition file and an example polymer chemistry definition (``protein'')}; \item \pxmdata\ {\footnotesize this is an \emph{optional} non-binary package where other example polymer chemistry definitions are delivered, so that the user might learn how to prepare other packages to submit to the \pxm development team for incorporation in the \pxm software suite as official packages}; \end {enumerate} \noindent In the rest of this manual we shall call ``module'' a set of functionalities that are aimed at a specific task: for example, all the functionalities that are accessible in the \progname{\pxmng} binary program with the aim of defining polymer chemistries will be called the ``\pxd module'' and will be triggered by using the menu tree rooted at the ``\pxd'' menu item. \bigskip The fact that the \pxm\ software suite is able to handle any polymer chemistry is, as we said above, due to its ability to interface a polymer sequence with a polymer chemistry definition. To explain this clearly, imagine a protein sequence that would be this tetrapeptide: ``ATGC'', which reads as ``AlanineThreonineGlycineCysteine''. Now imagine the same ``ATGC'' sequence but as a DNA sequence, which reads as ``AdenineThymineGuanineCytosine''. The two sequences would be entered in a sequence editor by keying in the following key sequence: \kbdKey{A}\kbdKey{T}\kbdKey{G}\kbdKey{C}. Of course, while the sequence is identical in both cases, you'd expect that the masses for the DNA sequence be much higher than the masses for the protein sequence. This is where ``abstraction'' comes in, and modularity also: in order to let the user perform the required computations as flexibly as possible, she first defines two different polymer chemistries: the first named ``protein'' and the second named ``dna''. In each of the two distinct polymer chemistry definitions, the user will enter a formula corresponding to each monomer (A,T,G,C). Of course, the monomer formula for a Threonine is very different than the one for a Thymine. The definition of the polymer chemistry is performed in the \pxd module that is accessible in the \pxmng program under the ``\pxd'' menu item. Once a polymer chemistry definition is saved, it may be made available to the system (we'll see how this is done). And when a polymer chemistry definition is made available to the system, any new polymer sequence may be created that abides by this polymer chemistry definition. By defining precisely the chemical behaviour of a polymer type, and making an association between a given polymer chemistry definition and a polymer sequence, the user makes use of the \emph {abstraction layer} that we mentioned above. Once this is well understood, the originality of the \pxm software framework is understood. This is precisely what sets \pxm apart from the other mass spectrometry-related software offerings. Since the different functionalities offered by the \pxm framework are well confined in three distinct modules, all accessible from the \progname{\pxmng} binary program, but sitting in clearly distinct menu trees, we'll review each of such ``modules'' in later chapters. Before going on with the description of the different modules, I would like to introduce some other more chemistry-oriented concepts that are going to be used throughout the \pxm framework. \renewcommand{\sectitle}{On Formulae And Chemical Reactions} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} It is all the more frequent for any user who runs any of the \pxm' modules to make use of formulae or of chemical reactions. These two chemical entities are not identical in \pxm. While a formula represents a chemical status (a monomer has a given formula, and does not change it), a chemical reaction is something much more dynamic, I should say ``active''. This difference is very important in \pxm. Let's take an example: the Lysyl monomer (we call a protein ``residue'' a ``monomer'') has the following formula: $\mathrm{C_6H_{12}N_2O}$. If I wish to acetylate this Lysyl monomer, the reaction will read this way: ``An acetic acid molecule will condense onto the amine of the Lysyl side chain''. This can also read: ---\textsl{``An acetyl group enters the Lysyl side chain while a hydrogen atom leaves the Lysyl side chain; water is lost in the process''}. If we wanted to put this into a more chemistry-oriented representation, we could write this: \[\mathrm{R-NH_2 + CH_3COOH \rightleftharpoons R-NH-CO-CH_3 + H_2O}\] That is more briefly stated this other way: ``$\mathrm {-H_2O+CH_3COOH}$''. This is exactly what \pxm\ calls an \emph {``actionformula''} ---or, for brevity--- an \emph {``actform''}: just because actions are associated with formulae; here the $\mathrm {H_2O}$ formula is associated with the $\mathrm {-}$, which indicates that the water molecule leaves the molecules being reacted, while the $\mathrm {CH_3COOH}$ formula is associated with the $\mathrm {+}$, which means that the acetic acid molecule enters in to the target molecule. The net formula is thus, as stated earlier: ---\textsl{``An acetyl group enters the Lysyl side chain while a hydrogen atom leaves the Lysyl side chain; water is lost in the process''}. In the \pxm\ framework, the \emph {formula} and \emph {actform} chemical entities are \emph {not} interchangeable. \renewcommand{\sectitle}{The \pxm\ Framework Data Format} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} All the data in the \pxm\ framework are stored on disk as \fileformat{XML}-formatted files. \fileformat{XML} is the \emph{eXtensible Markup Language}. This ``language'' allows to describe the structure of a document. Have you ever opened an \fileformat{HTML} file with a text editor? If so, you have certainly seen some markup like \verb|<H1>This is the title</H1>|. The browser that loads this file will understand (because it has been programmed to do so) that the title ``This is the title'' is to be displayed onto the screen using a bold sans-serif font, for example. Well, let us just say that the \fileformat{XML} file format is an immensely more powerful equivalent of \fileformat{HTML}. There would be a lot\dots a lot to say about \fileformat{XML} and \emph{Document Type Definition}\/s: I'll refrain from entering into the details. The big advantage of using such \fileformat{XML} format in \pxm\ is that it is a text format, and not a binary one. This means that any data in the \pxm\ package is human-readable (even if the \fileformat{XML} syntax makes it a bit difficult to read data, it is actually possible). Try to read one polymer chemistry definition \filename{.xml} file from the \pxmdata\ package (say, the \filename {dna-sample.xml} file, for example), and you'll see that this is pure text (the same applies for the \filename{.pxm} polymer sequence files in the same package. The advantages of using text file formats, with respect to binary file formats are: \begin{itemize} \item if somebody sends you a file and you do not have the program that made it, you still can extract information from the file, because it is readable with any text editor; \item if a text file (such as your most important polymer sequence \fileformat {XML} file) gets corrupted for some reason (\textit{i.e.} during backup on a bad support, or whatever) you will still be able to extract from the corrupted file all the bits of information that surround the portion that is corrupted, thus minimizing the data loss. This would be impossible with binary files, as they are just totally useless if a single part of them is corrupted; \item imagine you would like to write down a simple script that would allow you to find ---in a given directory--- all the sequence files that contain the ``myo'' character string in the polymer's name field (in \fileformat{XML} a field is called \emph{element}). You can do it easily \emph{without} asking anybody for the file format specification ---because your sequence files are just text files. \end{itemize} As an example of how simple it is I'll just write a \software{bash} shell script below that I'll save into the \filename{polname-find.sh} file in order to execute it afterwards. That is how the shell script looks like in the \filename{polname-find.sh} file:\\ \prompt\ \command{cat} \filename{polname-find.sh} \kbdEnterKey\ \begin{alltt}\textsl{ for i in *.pxm do grep "<name>.*myo.*</name>" \$i ; if [ \$? == 0 ] then echo "in file \$i" fi done} \end{alltt} Now we should make this brand new file executable so we can run it: \prompt\ \command{chmod} \option{u+x} \filename{polname-find.sh} \kbdEnterKey\ Upon execution of this script, the output looks like this: \prompt\ \filename{./polname-find.sh} \kbdEnterKey\ \begin{alltt}\textsl{ <name>myoglobin-horse</name> in file myoglob-h.pxm <name>myosin-chicken</name> in file myos-chck.pxm <name>myo-fragment1</name> in file myofrag1.pxm <name>apomyoglobin-rabbit</name> in file apomyo-rbt.pxm} \end{alltt} The script has gone through all the \filename{*.pxm} files and for each file has searched a start tag \verb|<name>| followed by some string containing ``myo'' followed by the end tag \verb|</name>|. If ``myo'' is found, the corresponding line is printed to the screen, and the name of the file containing this pattern is printed also. With a binary file format this would have been impossible. This little script lets you screen a big database like a snap. That's the power of \OSname{UNIX} and \OSname{UNIX}-like operating systems. \renewcommand{\sectitle}{Editing the Data in \pxm\ Files} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} The aim of \pxm\ is to let people use the software the way they like, with no preconception on the way they interact with it. The \fileformat {XML} files (polymer sequence or polymer chemistry definition files) can be edited using the graphical interface but also using a simple text editor. Figure~\ref{fig:poldef-comp-gtk-emacs} shows two rather different means to the same end: editing a polymer chemistry definition file. The Document Type Definition (DTD) is not shown on the right pane of the figure, since it is at the top of the file being displayed. This DTD will help the user to determine how to edit the file in a safe way, by telling where each element is authorized to be, and so on\dots You'll need to learn \fileformat{XML} if you wish to understand the DTD (a sunday afternoon will suffice). Usually, the safer way to do any editing is by using the graphical interface, not because the \pxm\ framework understands the edited data better this way, but because the graphical interface layout (acting like a data correctness censor) just prevents the user from writing badly-formed data directly in the \fileformat{XML} file. \begin{figure} \begin{center} \includegraphics[scale=0.7]{figures/raster/poldef-comp-gtk-emacs.png} \end{center} \caption[Graphical and text editing of a polymer chemistry definition]{\textbf{Comparison of a graphical and a text way of editing a polymer chemistry definition file.} The left pane shows the graphical interface that is exposed to the user when defining a polymer. The right pane shows the same \fileformat{XML} file opened in the \progname{Emacs} editor with the \fileformat{XML} editing mode switched on.} \label{fig:poldef-comp-gtk-emacs} \end{figure} The example shown in Figure~\ref{fig:poldef-comp-gtk-emacs} can be transposed to the polymer sequence \fileformat{XML} files in a very same way. Of course all the process that leads to ``creating'' a new polymer chemistry definition is going to be explained in detail in a later chapter (see chapter~\ref{chap:polyxdef}, page~\pageref{chap:polyxdef}). \renewcommand{\sectitle}{General Polymer Element Naming Policy} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} \label{sect:polymer_element_naming_policy} Unless otherwise specified, it is \emph{strongly} suggested \emph{not} to insert any non-alphanumeric-non-ASCII character (space, \%, \#, \$\dots) in the strings that the user enters to identify polymer chemistry definition items. This means that, for example, the user must refrain from using non-alphanumeric-non-ASCII characters for the atom name and symbol, the name, the code or the formula of the monomers or of the modifications, or of the cleavage specifications, or of the fragmentation specifications\dots\ Usually, the accepted delimiting characters are `\-' and `\_'. It is important not to cripple these polymer data for two main reasons: \begin{itemize} \item so that the program performs smoothly (some parsing processes rely on specific characters (like `\#' or `\%', for example) to isolate sub-strings starting from larger strings); \item so that the results can be easily and clearly displayed when time comes to print all the data. \end{itemize} \renewcommand{\sectitle}{Graphical Interface Design} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} For those coming to \OSname{UNIX} after having used \OSname{MS Windows} (like me), I would like to state some general graphical interface design specificities of the \OSname{UNIX} world. The \OSname{MS Windows} graphical environment was designed in such a way that the user is very strictly restricted to a narrow path each time she initiates an action. That policy has often led to arbitrary limitations in the design of software running on the \OSname{MS Windows} systems. This is not going to be exactly the same with a \OSname{UNIX} graphical environment: you almost certainly are going to quickly have a great number of windows opened on your desktop; you are the one who knows when to close a results window, not the program designer. When a window is opened, it is not going to be systematically required that it be closed before opening another one. This has a simple reason: imagine that you wanted to compare the oligomers generated by using two different enzymes on the same polymer sequence; you'll need both results windows to be opened at the same time, otherwise how comparison of oligomers could happen? That reasoning is true for a number of situations, and ---yes--- you'll be responsible for closing the windows you do not need anymore! This general behaviour is highly desirable, since it indeed allows the user to make comparisons between the data from two different experiments right after having generated the data. But this behaviour introduces a risk: how will it be possible to ascertain that any given set of peptides does come from the cleavage of the first protein using cleaving-agent-1 and not from the cleavage of the first protein using cleaving-agent-2? In other words: how are you going to recognize which results window contains the peptides of the first cleavage, and which results window contains the peptides obtained from the second cleavage? There is an answer: each time a window is displayed ---if there is a risk of ambiguity--- it will show the identity number (\guilabel{ID number}) of the polymer to which it is related. This ID number is nothing else but the \emph{unique} memory address of the polymer sequence editing context to which the window is related. In any situation where an ambiguity exists about the identity of the data generated on any given polymer sequence, a traceability system is used, as shown in Figure~\ref{fig:identities-follow-results}. \begin{figure} \begin{center} \includegraphics[scale=1.75] {figures/raster/identities-follow-results.png} \end{center} \caption[Identity of polymer sequences]{\textbf{Unambiguous identification of polymer sequences and related data.} When a polymer sequence is loaded/created, it is assigned a numeric value that unambiguously identifies it {\small (for the programmer, this is the pointer to the polymer structure)}. Each time a window is displayed that contains data pertaining to any given polymer sequence (oligomers generated by cleavage of a given polymer sequence, for example), it is given a reference to the polymer whence the data came, and this reference is the polymer's identity number. This is clearly visible here: the polymer sequence has a given \guilabel{ID Number} and all the related windows display that same number. Note that the cleavage results data have another \guilabel{ID Number} that is later used to trace the mass find results data (last bottom window).} \label{fig:identities-follow-results} \end{figure} \renewcommand{\sectitle}{Feedback From \pxm\ To The User: The Console Window} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} Something very specific to the \OSname{UNIX} and \OSname{UNIX}-like systems (and that I really like) is the fact that the programs are usually designed to be ``verbose'' (if the user asks this). The usual means to giving feedback in other systems is to pop up a ``dialog'' window in which a message is displayed, and the user has to acknowledge (typically by clicking onto a button widget) in order to close the dialog window. \pxm\ has been implemented with the ``console'' philosophy in mind: every message that it wishes to ``hand out'' to the user is sent to the terminal window from which the program was started. There are two levels of very important messages: the \emph{CRITICAL} and the \emph{ERROR} level messages. The CRITICAL-level messages indicate that time has come to make a quick save of all the data, because something bad might happen. ERROR-level messages cannot even be read in the console window, because they elicit an abortion of the program. These abortions are voluntary on the \pxm' part, because the error is so bad that it would crash anyway soon or later. Each time a message (of any importance level) is issued to the user, the console window is presented to the user (if it was hidden or minimized, this console window is show afresh). Figure~\ref{fig:polyxmass-console-wnd} shows the console window with a warning. \begin{figure} \begin{center} \includegraphics[scale=1.75]{figures/raster/polyxmass-console-wnd.png} \end{center} \caption[The console window]{\textbf{The console window where any messages to the user are displayed.} Depending on the importance level of the message being issued to the user, the color will be more or less ``reddish''.} \label{fig:polyxmass-console-wnd} \end{figure} \renewcommand{\sectitle}{Window Management} \section*{\sectitle} \addcontentsline{toc}{section}{\numberline{}\sectitle} \pxm is powerful and flexible: any number of polymer sequences can be opened at any given time, and any number of simulations might be performed on any of these polymer sequences. This might lead to a huge number of windows opened on the desktop at any given time. There are two main types of windows: \begin{itemize} \item windows that do not display results. These window are typically windows where the user is provided with options to perform some action. For example, one such window might be the window that allows the user to select an enzyme when an enzymatic cleavage is required on a polymer sequence; \item windows that are responsible for displaying a polymer sequence, the results of some simulation or of any computation. For example, a window displaying results (\emph{results window}, for short) might be the window that displays all the oligomers obtained upon cleavage of a polymer sequence using one enzymatic agent; or the window where all the fragmentation oligomers are displayed after the gas-phase fragmentation of a polymer sequence. Other examples are the windows where the monomeric composition of the polymer sequence is displayed and where the pH/pKa/pI computation results are displayed\dots \end{itemize} In order to ease the management of all the results windows opened at any given time, a window management facility was devised. Its incarnation is shown in Figure~\vref{fig:polyxmass-window-management}. This window is called by using the main program's window menu \centerline{\guimenu{View}\guimenuitem{Window List}\\} \begin{figure} \begin{center} \includegraphics[scale=1.75] {figures/raster/polyxmass-window-management.png} \end{center} \caption[The window management facility]{\textbf{The window management facility.} Each time a polymer sequence window ---or a window where results are displayed--- is opened, it is registered and appears in the treeview on the left of the depicted window. The user can then select any window of interest and perform actions about this window.} \label{fig:polyxmass-window-management} \end{figure} The window management operations include the following actions, that apply to the window item currently selected in the \guilabel{Available Windows} treeview on the left of the window: \begin{itemize} \item \guilabel{Show Window:} {\footnotesize force a hidden/minimized window to show itself;} \item \guilabel{Hide Window:} {\footnotesize force a window to hide itself.} \end {itemize} \noindent As soon as a window that is listed in the \guilabel{Available Windows} treeview is closed, its corresponding item in the treeview is removed. \cleardoublepage %%% Local Variables: %%% mode: latex %%% TeX-master: "polyxmass" %%% End: