The LANL-Trace tracing mechanism is designed to integrate with MPI and capture parallel app I/O using either strace or ltrace.
Installation and running of LANL-Trace is straightforward. The code will basically almost work without requiring a large amount of modification. Everything is driven by scripts/mpirun, and you can use modules/mpitrace to put these scripts in your path. Once scripts/mpirun is in your path, you can execute your normal mpirun [whatever] command, and trace output should be produced.
However, the trace code attempts to build an executable in order to measure time skew and time drift before and after running your MPI program. Sometimes, this build may fail. If this happens, try to see the command that failed and execute it by hand. After this, when you try mpirun again, it will just use the executable that was just built, and won't rebuild it again.
You will also need to change paths in scripts/mpirun and modules/mpitrace. There are various other files in the directory, but you can ignore them as they are leftover aborted attempts. This mechanism should work with Open-MPI. For mpich, it's basically the same, except you'll want to use the standard mpich dbg=foo method and copy the mpirun_dbg.ltrace file to your mpich bin.
When the trace mechanism runs, it produces a number of files. There is a timing file, which attempts to capture the drift and skew of the distributed clocks. There is also a dirinfo file, which measures the free space of the storage system before and after the trace, as well as attempting to query various other aspects of the storage system. Additionally, there is a SUMMARY file, which shows the command run along with its arguments. Then, for each and every process, there are three files produced.
The machinename.pid.trace file contains the raw trace data, the machinename.pid.out file contains the standard output of that process, and the machinename.pid.summary file contains some simple summary and profiling info. Some of this information is also available in the raw trace data, but some is not. The summary file also contains summary data for system calls that weren't captured. The idea behind this tool is that it is for I/O only, so it attempts to construct a filtering regex to only capture I/O calls. The summary files list system calls that were captured by the underlying strace mechanism but weren't logged by LANL-Trace. In the event that you are interested in any of these, you can edit the filter regex accordingly.
In conclusion, LANL-Trace is a powerful tool that provides valuable insight into the I/O behavior of parallel applications. Its straightforward installation process, together with its detailed output and analysis features, make it a valuable asset to developers and researchers looking to optimize their parallel applications.
Version 1.0.0: N/A