MPE (Multi-Processing Environment) ---------------------------------- Version 2.4.6. March, 2008 Mathematics and Computer Science Division Argonne National Laboratory I. INTRODUCTION ---------------- The Multi-Processing Environment (MPE) attempts to provide programmers with a complete suite of performance analysis tools for their MPI programs based on post processing approach. These tools include a set of profiling libraries, a set of utility programs, and a set of graphical tools. The first set of tools to be used with user MPI programs is profiling libraries which provide a collection of routines that create log files. These log files can be created manually by inserting MPE calls in the MPI program, or automatically by linking with the appropriate MPE libraries, or by combining the above two methods. Currently, the MPE offers the following 4 profiling libraries. 1) Tracing Library - Traces all MPI calls. Each MPI call is preceded by a line that contains the rank in MPI_COMM_WORLD of the calling process, and followed by another line indicating that the call has completed. Most send and receive routines also indicate the values of count, tag, and partner (destination for sends, source for receives). Output is to standard output. 2) Animation Libraries - A simple form of real-time program animation that requires X window routines. 3) Logging Libraries - The most useful and widely used profiling libraries in MPE. These libraries form the basis to generate log files from user MPI programs. There are several different log file formats available in MPE. The default log file format is CLOG2. It is a low overhead logging format, a simple collection of single timestamp events. The old format ALOG, which is not being developed for years, is not distributed here. The powerful visualization format is SLOG-2, stands for Scalable LOGfile format version II which is a total redesign of the original SLOG format. SLOG-2 allows for much improved scalability for visualization purpose. CLOG2 file can be easily converted to SLOG-2 file through the new SLOG-2 viewer, Jumpshot-4. The MPI logging library is now thread-safe through the use of a global mutex over the the MPE logging library which is not yet thread-safe. 4) Collective and datatype checking library - An argument consistency checking library for MPI collective calls. It checks for datatype, root, and various argument consistency in MPI collective calls. If an error is detected, a backtrace of the callstack (on the supported platform) will be printed to locate the offended call. The set of utility programs in MPE includes log format converter (e.g. clogTOslog2), logfile print (e.g. slog2print) and logfile viewer and convertor (e.g. jumpshot). These new tools, clog2TOslog2, slog2print and jumpshot(Jumpshot-4) replace old tools, clog2slog, slog_print and logviewer (i.e. Jumpshot-2 and Jumpshot-3). For more information of various logfile formats and their viewers, see http://www.mcs.anl.gov/perfvis II. CONFIGURATION ----------------- MPE can be configured and installed as an extension to most MPI standard -compliant MPI implementations, e.g. MPICH-2, MPICH, OpenMPI, LAM/MPI, SGI's MPI, HP-UX's MPI, IBM's MPI, Cray's MPI and NEC's MPI. MPE has been integrated into MPICH and MPE2 has been integrated seamlessly into MPICH-2, so MPEx will be installed automatically during MPICHx's installation process. For details of configuration of MPE, see the INSTALL or INSTALL.cross file. III. INSTALLATION INSTRUCTIONS ------------------------------- For details of installation instruction/examples of MPE, see the INSTALL file. IV. EXAMPLE PROGRAMS ---------------------- As previously noted, the MPE library is composed of 3 different profiling libraries. During configure, the compiler's library linkage flags and appropriate libraries are determined. These variables are first substituted in the Makefiles in the directories, mpe2/src/wrappers/test, mpe2/src/graphics/contrib/test and mpe2/src/collchk/test. The Makefiles for mpe2/src/wrappers/test and mpe2/src/graphics/contrib/test are then installed into directory share/ as examples_logging/ and examples_graphics/ during the final installation process. The following are some of the crucial variables: LOG_LIBS = library flag that links with the logging libraries TRACE_LIBS = library flag that links with the tracing library ANIM_LIBS = library flag that links with the animation library COLLCHK_LIBS = library flag that links with the collective and datatype checking library The variable FLIB_PATH is the compiler's library path needed to link fortran MPI programs with the logging library. During make, small test programs cpi.c, cpilog.c and fpilog.f will be linked with each of the above libraries. In the output from Make, a message will be printed to indicate the status of each attempted link test. The success of these linkage tests will also be included in the Make output. If the linkage tests are successful, then these library linkage flags can be used for your programs as well. The following example programs are also included in the distribution: mpe/src/graphics/contrib/mandel is a Mandelbrot program that uses the MPE graphics package. mpe/src/graphics/contrib/mastermind is a program for solving the Mastermind puzzle in parallel. These programs should work on all MPI implementations, but have not been extensively tested. V. MPEINSTALL -------------- A 'mpeinstall' script is created during configuration. If configuring with MPICH and MPICH2, then the 'mpiinstall' script will invoke the 'mpeinstall' script. However, 'mpeinstall' can also be used by itself. This is only optional and is of use only if you wish to install the MPE library in a public place so that others may use it. Final install directory will consist of an include, lib, bin, sbin and share subdirectories. Examples and various logfile viewers will be installed under share. VI. USAGE --------- The final install directory contains the following subdirectories. include/ contains all the include files that user program needs to read. lib/ contains all the libraries that user program needs to link with. bin/ contains all the utility programs that user needs to use. doc/ contains available MPE documentation, e.g. Jumpshot-4's userguide. sbin/ contains the MPE uninstall script to uninstall the installation. share/ contains user read-only data. Besides share/examples_logging/ and share/examples_graphics/, user usually does NOT need to know the details of other subdirectories. In terms of usage of MPE, user usually only need to know about the files that have been installed in include/, lib/ and bin/. VI. a) CUSTOMIZING LOGFILES --------------------------- In addition to using the predefined MPE logging libraries to log all MPI calls, MPE logging calls can be inserted into user's MPI program to define and log states. These states are called User-Defined states. States may be nested, allowing one to define a state describing a user routine that contains several MPI calls, and display both the user-defined state and the MPI operations contained within it. The simplest way to insert user-defined states is as follows: 1) Get handles from MPE logging library: MPE_Log_get_state_eventIDs() has to be used to get unique event IDs (MPE logging handles). This is important if you are writing a library that uses the MPE logging routines from the MPE system. PS. Hardwiring the eventIDs is considered a bad idea since it may cause eventID confict and so the practice isn't supported. Older MPE libraries provide MPE_Log_get_event_number() which is still being supported but has been deprecated. Users are strongly urged to use MPE_Log_get_state_eventIDs() instead. 2) Set the logged state's characteristics: MPE_Describe_state() sets the name and color of the states. 3) Log the events of the logged states: MPE_Log_event() are called twice to log the user-defined states. Below is a simple example that uses the 3 steps outlined above. \begin{verbatim} int eventID_begin, eventID_end; ... MPE_Log_get_state_eventIDs( &eventID_begin, &eventID_end ); ... MPE_Describe_state( eventID_begin, eventID_end, "Multiplication", "red" ); ... MyAmult( Matrix m, Vector v ) { /* Log the start event of the red "Multiplication" state */ MPE_Log_event( eventID_begin, 0, NULL ); ... ... Amult code, including MPI calls ... ... /* Log the end event of the red "Multiplication" state */ MPE_Log_event( eventID_end, 0, NULL ); } \end{verbatim} The logfile generated by this code will have the MPI routines nested within the routine MyAmult(). Besides user-defined states, MPE2 also provides support for user-defined events which can be defined through use of MPE_Log_get_solo_eventID() and MPE_Describe_event. For more details, e.g. see cpilog.c. If the MPE logging library, liblmpe.a, is NOT linked with the user program, MPE_Init_log() and MPE_Finish_log() need to be used before and after all the MPE calls. Sample programs cpilog.c and fpilog.f are available to illustrate the use of these MPE routines. They are in the MPE source directory, mpe2/src/wrappers/test or the installed directory, share/examples_logging to illustrate the use of these MPE routines. For futher linking information, see section "Convenient Compiler Wrappers". For undefined user-defined state, i.e. corresponding MPE_Describe_state() has not been issued, new jumpshot (Jumpshot-4) may display the legend name as "UnknownType-INDEX" where INDEX is the internal MPE category index. VI. b) ENVIRONMENTAL VARIABLES ------------------------------ For MPE logging, MPE_TMPDIR and MPE_LOGFILE_PREFIX are 2 environment variables that most users find to be very useful. So it is recommended to set these 2 env. variables before launching the MPI program during logging : CLOG_BLOCK_SIZE: The integer value determines the clog2 buffer block size which set the least minimum clog2 file size. If CLOG_BLOCK_SIZE is not set, 64K per block is assumed. CLOG_BUFFERED_BLOCKS: The integer value determines the number of blocks witin the CLOG2's internal buffer. Together with CLOG_BLOCK_SIZE, CLOG_BUFFERED_BLOCKS determines how often the internal buffer is flushed to the disk. The total buffer size is determined by the product of CLOG_BLOCK_SIZE and CLOG_BUFFERED_BLOCKS. These 2 environmental variables allows user to minimize MPE2 logging overhead when large local memory is available. The default value is 128. MPE_TMPDIR: MPE_TMPDIR takes precedence over TMPDIR. It specifies a directory to be used as temporary storage for each process. By default, when MPE_TMPDIR and TMPDIR are NOT set, /tmp will be used. When user needs to generate a very large logfile for long-running MPI job, user needs to make sure that MPE_TMPDIR(or TMPDIR) is big enough to hold the temporary local logfile which will be deleted if the merged logfile can be created successfully. In order to minimize the overhead of the logging to the MPI program, it is highly recommended user to use a *local* file system for TMPDIR. Note : The final merged logfile will be written back to the file system where process 0 is. MPE_SAME_TMPDIR: The boolean value determines whether MPE_TMPDIR will be the same across the whole MPI_COMM_WORLD. By default, MPE_SAME_TMPDIR is set to true and only rank 0's MPE_TMPDIR will be broadcasted to every process. There is scalability implication on whether MPE_SAME_TMPDIR is true. When MPE_SAME_TMPDIR=false, MPE_TMPDIR could be set differently on different processes and hence mkstemp() will be called once on each process to create temporary filename. Calling mkstemp() on the same filesystem, e.g. $HOME, for each process is not scalable, i.e. the shared filesystem may be hanging. MPE_SAME_TMPDIR=false is provided for cases that MPE_TMPDIR has to be set differently on different process but it is in general not scalable when MPI_COMM_WORLD's size >> 1K. MPE_DELETE_LOCALFILE: The boolean value determines whether to delete the temporary local clog2 file. When this flag is set to true, user needs to collect from the temporary clog2 files from each slave node's MPE_TMPDIR. Then separate serial programs, clog2_join and clog2_repair, can be used to merge the local clog2 files. This process is useful e.g. when MPI_Finalize() fails to complete properly, e.g. due to user program overwritten to MPE/MPI internal data structures. MPE_CLOCKS_SYNC: The boolean value determines the behavior of MPE_Log_sync_clocks() and the default clock synchronization at the end of logging. Users may way to force MPE clock synchronization when the MPI implementation has buggy clock synchronization mechanism, e.g. Some versions of BG/L MPI's MPI_WTIME_IS_GLOBAL is incorrectly set to true when 64-ways or 256-ways partition is used. MPE_SYNC_ALGORITHM: specifies the clock synchronization algorithm. The accepted values are "DEFAULT", "SEQ", "BITREE" and "ALTNGBR". SEQ: a O(N) steps algorithm and is non-scalable and slowest but is also the most accurate. BITREE: a O(log2(N)) steps algorithm, scalable and much faster than SEQ but less accurate than SEQ. A good compromise. ALTNGBR: a O(1) steps algorithm, perfectly parallel is the fastest of 3 algorithms supported. It is also the least accurate. DEFAULT: uses SEQ when the number of processes <= 16. uses BITREE when number of processes > 16. MPE_SYNC_FREQUENCY: specifies the number of iterations of selected clock synchronization. In general, the higher of MPI_SYNC_FREQUENCY, the higher the probability of obtaining a accurate measurement of all the clocks, i.e. less error. Keep in mind, this is generally not a guarantee and is highly dependent of the system noise. The default is 3. MPE_LOGFILE_PREFIX: specifies the filename prefix of the output logfile. The file extension will be determined by the output logfile format, i.e. MPE_LOG_FORMAT. MPE_LOG_FORMAT: determines the format of the logfile generated from the execution of application linked with MPE logging libraries. The allowed value for MPE_LOG_FORMAT is CLOG2 only. So there is no need to use this variable at the moment. MPE_LOG_OVERHEAD: The boolean value determines to log MPE/CLOG2's internal profiling state CLOG_Buffer_write2disk(). The default setting is yes. CLOG_Buffer_write2disk labels region in each process that MPE/CLOG2 spends on flushing logging data in the memory to the disk. The frequency and location of CLOG_Buffer_write2disk state can be altered by changing CLOG_BLOCK_SIZE and/or CLOG_BUFFERED_BLOCKS. MPE_LOG_RANK2PROCNAME: The boolean value determines if a .clog2.pnm file that contains the MPI_COMM_WORLD rank to processor name determined by MPI_Get_processor_name(). The default is No. MPE_WRAPPERS_ADD_LDFLAGS: The variable tells MPE wrappers, mpecc/mpefc (includes mpich2's mpicc and friends), to use LDFLAGS added by MPE, e.g. -Wl,--export-dynamic. The default is yes. User can override it by setting it to "no" or "false". MPE_USE_FCONSTS_IN_MPIH : The boolean value determines if MPI_F_* constants from mpi.h will be used. By default, MPE computes the MPI_F_* constants instead of reading from mpi.h. The affected constants are MPI_F_STATUS(ES)_IGNORE. Possible boolean values are "true", "false", "yes" and "no" in either all lower or upper cases. For MPE X11 graphics, environment variables DISPLAY set in each process is read during MPE_Open_graphics. DISPLAY: determines where MPE X11 graphics on each process is connected to. VI. c) EXAMPLE MAKEFILE ----------------------- The install directories, share/examples_logging, share/example_graphics and share/example_collchk contain some very useful and simple example programs. The Makefiles in these directories illustrate the usage of MPE routines and how to link with various MPE libraries. In most cases, users can simply copy the share/examples_logging/Makefile to their home directory, and do a "make" to compile the suggested targets. Users don't need to copy the .c and .f files when MPE has been compiled with a MAKE that has VPATH support. The created executables can be launched with mpiexec or mpirun from the MPI implementation to generate sample logfiles. VI. d) UTILITY PROGRAMS ----------------------- In bin/, user can find several useful utility programs when manipulating logfiles. These includes log format converters, log format print programs, and logfile display program, Log Format Converters --------------------- clog2TOslog2 : a CLOG2 to SLOG-2 logfile convertor. For more details, do "clog2TOslog2 -h". rlogTOslog2 : a RLOG to SLOG-2 logfile convertor. For more details, do "rlogTOslog2 -h". Where RLOG is an internal MPICH2 logging format. logconvertor : a standalone GUI based convertor that invokes clog2TOslog2 or rlogTOslog2 based on logfile extension. The GUI also shows the progress of the conversion. The same convertor can be invoked from within the logfile viewer, jumpshot. slog2filter : a SLOG-2 to SLOG-2 logfile convertor. It allows for removal unwanted categories (when used with slog2print -c). It also allows for changing of the SLOG-2 internal structure, e.g. modify the duration of preview drawable. The tool reads and writes SLOG-2 file of same version. slog2updater: a SLOG-2 file format update utility. It is essentially a slog2filter that reads in older SLOG-2 file and writes out the latest SLOG-2 file format. Log Format Print Programs ------------------------- clog2_print : a stdout print program for CLOG file. Java version is named as clogprint. clog2_join : a clog2 serial merging program that merges clog2 files 1) temporary local clog2 files which all are from the same MPI_COMM_WORLD. 2) merged clog2 files from each MPI_COMM_WORLDs (Incomplete!, timestamps are not sync'ed yet.) clog2_repair : a clog2 repair program that tries to fix the missing data of a clog2 file (when the MPI program that is being profiled aborts) so that the file can be processed by other tools like clog2TOslog2. rlog_print : a stdout print program for SLOG-2 file. slog2print : a stdout print program for SLOG-2 file. Log File Display Program ------------------------ jumpshot : the Jumpshot-4 launcher script. Jumpshot-4 does logfile conversion as well as visualization. To view a logfile, say fpilog.slog2, do jumpshot fpilog.slog2 The command will select and invoke Jumpshot-4 to display the content of SLOG-2 file if Jumpshot-4 has been built and installed successfully. One can also do jumpshot fpilog.clog2 or jumpshot barrier.rlog Both will invoke the logfile convertor first before visualization. Collective and Datatype Checking -------------------------------- Linking an MPI application with the collective and datatype checking library as follows mpicc -o mpi_pgm *.o -L<mpe2_libdir> -lmpe_collchk. Or using compiler wrappers (more details in next section), e.g. (with mpich2's compiler wrapper) mpicc -mpe=mpicheck -o mpi_pgm *.o (with MPE's compiler wrapper) mpecc -mpe=mpicheck -o mpi_pgm *.o Convenient Compiler Wrappers ---------------------------- Standalone MPE installation with non-MPICH2 will see 2 convenient compiler wrappers mpecc and mpefc which mimic the typical usage of mpicc/mpif77/mpif90 in MPICH* by providing convenient compilation and linking switches. mpecc is for C program, and mpefc is for Fortran program. Typically, user can use mpecc as follows: mpecc -mpilog -o cpilog cpilog.c which is equivalent to doing mpicc -o cpilog cpilog.c -L<mpe2_libdir> -llmpe -lmpe Available MPE profiling options for "mpecc" and "mpefc" are as follows: -mpilog : Automatic MPI and MPE user-defined states logging. This links against -llmpe -lmpe. -mpitrace : Trace MPI program with printf. This links against -ltmpe. -mpianim : Animate MPI program in real-time. This links against -lampe -lmpe. -mpicheck : Check MPI Program with the Collective & Datatype Checking library. This links against -lmpe_collchk. -graphics : Use MPE graphics routines with X11 library. This links against -lmpe <X11 libraries>. -log : MPE user-defined states logging. This links against -lmpe. -nolog : Nullify MPE user-defined states logging. This links against -lmpe_null. -help : Print this help page. MPE has been seamlessly integrated into MPICH2 and MPICH distributions, In MPICH2, all the convenient compilation and linking switches described above are provided through -mpe= option in mpicc/mpicxx/mpif77/mpif90. For instance, to compile and link cpilog with automatic MPI logging library can be done as follows mpicc -mpe=mpilog -o cpilog cpilog.c which is equivalent to the mpecc command mpecc -mpilog -o cpilog cpilog.c In MPICH(the old one), the compiler wrappers, mpicc/mpiCC/mpif77/mpif90 do not provide all the convenient switches listed above, only 3 of them are available. These options are : -mpitrace - to compile and link with tracing library. -mpianim - to compile and link with animation libraries. -mpilog - to compile and link with logging libraries. For instance, the following command creates executable, {\tt fpilog}, which generates logfile when it is executed. mpif77 -mpilog -o fpilog fpilog.f VII. Using MPE in MPICHx ------------------------ VII. a) Inheritance of Environmental Variables ---------------------------------------------- MPE relies on certain environmental variables (e.g. MPE_TMPDIR). These variables determine how MPE behaves. It is important to make sure that all the MPI processes receive the intended value of environmental variables. The complication of this issue comes from the fact that different MPI implementations have different ways of passing environmental varaiable. For instance, MPICH contains many different devices for different platforms, some of these devices have their unique way of passing of environmental variables to other processes. The often used devices, like ch_p4 and ch_shmem, do not require special attention to pass the value of the environmental variable to spawned processes. The spawned process inherits the value from the launching process when the environmental variable in the launching process has been set. But this is NOT true for all the devices, for instance, the ch_p4mpd device requires special option of mpirun to set environmental variables to all processes. mpirun -np N fpilog -MPDENV- MPE_LOGFILE_PREFIX=fpilog In this example, the option -MPDENV- is needed to make sure that all processes have their environmental variable, MPE_LOGFILE_PREFIX, set to the desirable output logfile prefix. In MPICH2, when using MPD as a process manage, passing MPE_LOGFILE_PREFIX and MPE_TMPDIR can be done as follows: mpiexec -env MPE_LOGFILE_PREFIX <output-logname-prefix> \ -env MPE_TMPDIR <local-tmp-dir> -n 32 <executable-name> Also, with MPE X11 graphics library, the local DISPLAY variable set on each process is read (so ssh tunnelling can be used), for examples assume an mpd ring has been set up by on 2 machines, schwinn and triumph, as follows: cat > mpd.hosts << EOF schwinn triumph EOF mpdboot -n 2 -f mpd.hosts Now launch MPE X11 graphics sample code, cxgraphics, as follows: mpiexec -host schwinn -env DISPLAY <display_0> cxgraphics : \ -host triumph -env DISPLAY <display_1> cxgraphics Where <display_0> and <display_1> are the local DISPLAY variable echoed from consoles connected to schwinn and triumph respectively. For other MPI implementations, how environmental variables are passed remains unchanged. User needs to get familar with the runtime environment and set the environmental variables appropriately. VII. b) Viewing Logfiles ------------------------ MPE's install directory structure is the same as MPICH's and MPICH-2's. So all MPE's utility programs will be located in the bin/ directory of MPICH and MPICH-2.