Open MPI logo

FAQ:
General run-time tuning

  |   Home   |   Support   |   FAQ   |   all just the FAQ

Table of contents:

  1. What is the Modular Component Architecture (MCA)?
  2. What are MCA parameters?
  3. What frameworks are in Open MPI?
  4. What frameworks are in Open MPI v1.2 (and prior)?
  5. What frameworks are in Open MPI v1.3?
  6. How do I know what components are in my Open MPI installation?
  7. How do I install my own components into an Open MPI installation?
  8. How do I know what MCA parameters are available?
  9. How do I set the value of MCA parameters?
  10. What are Aggregate MCA (AMCA) parameter files?
  11. How do I select which components are used?
  12. What is processor affinity? Does Open MPI support it?
  13. What is memory affinity? Does Open MPI support it?
  14. How do I tell Open MPI to use processor and/or memory affinity?
  15. Does Open MPI support calling fork() or system() in MPI processes?
  16. I want to run some performance benchmarks with Open MPI. How do I do that?


1. What is the Modular Component Architecture (MCA)?

The Modular Component Architecture (MCA) is the backbone for much of Open MPI's functionality. It is a series of frameworks, components, and modules that are assembled at run-time to create an MPI implementation.

Frameworks: An MCA framework manages zero or more components at run time and is targeted at a specific task (e.g., provide MPI collective operation functionality). Each MCA framework supports a single component type, but may support multiple versions of that type. The framework uses the services from the MCA base functionality to find and/or load components.

Components: An MCA component is an implementation of a framework's interface. It is a standalone collection of code that can be bundled into a plugin that can be inserted into the Open MPI code base, either at run-time and/or compile-time.

Modules: An MCA module is an instance of a component (in the C++ sense of the word "instance"; an MCA component is analogous to a C++ class). For example, if a node running an Open MPI application has multiple ethernet NICs, the Open MPI application will contain one TCP MPI point-to-point component, but two TCP point-to-point modules.

Frameworks, components, and modules can be dynamic or static. That is, they can be available as plugins or they may be compiled statically into libraries (e.g., libmpi).


2. What are MCA parameters?

MCA parameters are the basic unit of run-time tuning for Open MPI. They are simple "key = value" pairs that are used extensively throughout the code base. The general rules of thumb that the developers use are:

  • Instead of using a constant for an important value, make it an MCA parameter
  • If a task can be implemented in multiple, user-discernible ways, implement as many as possible and make choosing between them be an MCA parameter

For example, an easy MCA parameter to describe is the boundary between short and long messages in TCP wire-line transmissions. "Short" messages are sent eagerly whereas "long" messages use a rendezvous protocol. The decision point between these two protocols is the overall size of the message (in bytes). By making this value an MCA parameter, it can be changed at run-time by the user or system administrator to use a sensible value for a particular environment or set of hardware (e.g., a value suitable for 100 Mbps Ethernet is probably not suitable for Gigabit Ethernet, and may require a different value for 10 Gigabit Ethernet).

Note that MCA parameters may be set in several different ways (described in another FAQ entry). This allows, for example, system administrators to fine-tune the Open MPI installation for their hardware / environment such that normal users can simply use the default values.

More specifically, HPC environments -- and the applications that run on them -- tend to be unique. Providing extensive run-time tuning capabilities through MCA parameters allows the customization of Open MPI to each system's / user's / application's particular needs.


3. What frameworks are in Open MPI?

There are three types of frameworks in Open MPI: those in the MPI layer (OMPI), those in the run-time layer (ORTE), and those in the operating system / platform layer (OPAL).

The specific list of frameworks varies between each major release series of Open MPI. See the links below to FAQ entries for specific versions of Open MPI:


4. What frameworks are in Open MPI v1.2 (and prior)?

The comprehensive list of frameworks in Open MPI is continually being augmented. As of August 2005, here is the current list:

OMPI frameworks

  • allocator: Memory allocator
  • bml: BTL management layer (managing multiple devices)
  • btl: Byte transfer layer (point-to-point byte movement)
  • coll: MPI collective algorithms
  • io: MPI-2 I/O functionality
  • mpool: Memory pool management
  • pml: Point-to-point management layer (fragmenting, reassembly, top-layer protocols, etc.)
  • osc: MPI-2 one-sided communication
  • ptl: (outdated / deprecated) MPI point-to-point transport layer
  • rcache: Memory registration management
  • topo: MPI topology information

ORTE frameworks

  • errmgr: Error manager
  • gpr: General purpose registry
  • iof: I/O forwarding
  • ns: Name server
  • oob: Out-of-band communication
  • pls: Process launch subsystem
  • ras: Resource allocation subsystem
  • rds: Resource discovery subsystem
  • rmaps: Resource mapping subsystem
  • rmgr: Resource manager (upper meta layer for all other Resource frameworks)
  • rml: Remote messaging layer (routing of OOB messages)
  • schema: Name schemas
  • sds: Startup discovery services
  • soh: State of health

OPAL frameworks

  • maffinity: Memory affinity
  • memory: Memory hooks
  • paffinity: Processor affinity
  • timer: High-resolution timers


5. What frameworks are in Open MPI v1.3?

The comprehensive list of frameworks in Open MPI is continually being augmented. As of November 2008, here is the current list in the Open MPI v1.3 series:

OMPI frameworks

  • allocator: Memory allocator
  • bml: BTL management layer
  • btl: MPI point-to-point Byte Transfer Layer, used for MPI point-to-point messages on some types of networks
  • coll: MPI collective algorithms
  • crcp: Checkpoint/restart coordination protocol
  • dpm: MPI-2 dynamic process management
  • io: MPI-2 I/O
  • mpool: Memory pooling
  • mtl: Matching transport layer, used for MPI point-to-point messages MPI-2 one-sided communications
  • pml: MPI point-to-point management layer
  • pubsub: MPI-2 publish/subscribe management
  • rcache: Memory registration cache
  • topo: MPI topology routines

ORTE frameworks

  • errmgr: RTE error manager
  • ess: RTE environment-specfic services
  • filem: Remote file management
  • grpcomm: RTE group communications
  • iof: I/O forwarding
  • odls: OpenRTE daemon local launch subsystem
  • oob: Out of band messaging
  • plm: Process lifecycle management
  • ras: Resource allocation system
  • rmaps: Resource mapping system
  • rml: RTE message layer
  • routed: Routing table for the RML
  • snapc: Snapshot coordination

OPAL frameworks

  • backtrace: Debugging call stack backtrace support
  • carto: Cartography (host/network mapping) support
  • crs: Checkpoint and restart service
  • installdirs: Installation directory relocation services
  • maffinity: Memory affinity
  • memchecker: Run-time memory checking
  • memcpy: Memopy copy support
  • memory: Memory management hooks
  • paffinity: Processor affinity
  • timer: High-resolution timers


6. How do I know what components are in my Open MPI installation?

The ompi_info command, in addition to providing a wealth of configuration information about your Open MPI installation, will list all components (and the frameworks that they belong to) that are available. These include system-provided components as well as user-provided components.


7. How do I install my own components into an Open MPI installation?

By default, Open MPI looks in two places for components at run-time (in order):

  1. $prefix/lib/openmpi/: This is the system-provided components directory, part of the installation tree of Open MPI itself.
  2. $HOME/.openmpi/components/: This is where users can drop their own components that will automatically be "seen" by Open MPI at run-time. This is ideal for developmental, private, or otherwise unstable components.

Note that the directories and search ordering used for finding components in Open MPI is, itself, an MCA parameter. Setting the mca_component_path changes this value (a colon-delimited list of directories).

Note also that components are only used on nodes where they are "visible." Hence, if you $prefix/lib/openmpi/ is a directory on a local disk that is not shared via a network filesystem to other nodes where you run MPI jobs, then components that are installed to that directory will only be used by MPI jobs running on the local node.

More specifically: components have the same visibility as normal files. If you need a component to be available to all nodes where you run MPI jobs, then you need to ensure that it is visible on all nodes (typically either by installing it on all nodes for non-networked filesystem installs, or by installing them in a directory that is visibile to all nodes via a networked filesystem). Open MPI does not automatically send components to remote nodes when MPI jobs are run.


8. How do I know what MCA parameters are available?

The ompi_info command can list the parameters for a given component, all the parameters for a specific framework, or all parameters. Most parameters contain a description of the parameter; all will show the parameter's current value.

For example:

shell$ ompi_info --param all all

Shows all the MCA parameters for all components that ompi_info finds, whereas:

shell$ ompi_info --param btl all

Shows all the MCA parameters for all BTL components that ompi_info finds. Finally:

shell$ ompi_info --param btl tcp

Shows all the MCA parameters for the TCP BTL component.


9. How do I set the value of MCA parameters?

There are three main ways to set MCA parameters, each of which are searched in order.

  1. Command line: The highest-precedence method is setting MCA parameters on the command line. For example:

    shell$ mpirun --mca mpi_show_handle_leaks 1 -np 4 a.out
    

    This sets the MCA parameter mpi_show_handle_leaks to the value of 1 before running a.out with four processes. In general, the format used on the command line is "--mca <param_name> <value>".

    Note that when senting multi-word values, you need to use quotes to ensure that the shell and Open MPI understand that they are a single value. For example:

    shell$ mpirun --mca param "value with multiple words" ...
    

  2. Environment variable: Next, environment variables are searched. Any environment variable named OMPI_MCA_<param_name> will be used. For example, the following has the same effect as the previous example (for sh-flavored shells):

    shell$ OMPI_MCA_mpi_show_handle_leaks=1
    shell$ export OMPI_MCA_mpi_show_handle_leaks
    shell$ mpirun -np 4 a.out
    

    Or, for csh-flavored shells:

    shell% setenv OMPI_MCA_mpi_show_handle_leaks 1
    shell% mpirun -np 4 a.out
    

    Note that setting environment variables to values with multiple words requires quoting, such as:

    # sh-flavored shells
    shell$ OMPI_MCA_param="value with multiple words"
    
    # csh-flavored shells
    shell% setenv OMPI_MCA_param "value with multiple words"
    

  3. Aggregate MCA parameter files: Simple text files can be used to set MCA parameter values for a specific application. See this FAQ entry (Open MPI version 1.3 and higher).
  4. Files: Finally, simple text files can be used to set MCA parameter values. Parameters are set one per line (comments are permitted). For example:

    # This is a comment
    # Set the same MCA parameter as in previous examples
    mpi_show_handle_leaks = 1
    

    Note that quotes are not necessary for setting multi-word values in MCA parameter files. Indeed, if you use quotes in the MCA parameter file, they will be used as part of the value itself. For example:

    # The following two values are different:
    param1 = value with multiple words
    param2 = "value with multiple words"
    

    By default, two files are searched (in order):

    1. $HOME/.openmpi/mca-params.conf: The user-supplied set of values takes the highest precedence.
    2. $prefix/etc/openmpi-mca-params.conf: The system-supplied set of values has a lower precedence.

    More specifically, the MCA parameter mca_param_files specifies a colon-delimited path of files to search for MCA parameters. Files to the left have lower precedence; files to the right are higher precedence.

    Keep in mind that, just like components, these parameter files are only relevant where they are "visible" (see this FAQ entry). Specifically, Open MPI does not read all the values from these files during startup and then send them to all nodes in the job -- the files are read on each node during each process' startup. This is intended behavior: it allows for per-node customization, which is especially relevant in heterogeneous environments.


10. What are Aggregate MCA (AMCA) parameter files?

Starting with version 1.3, aggregate MCA (AMCA) parameter files contain MCA parameter key/value pairs similar to the $HOME/.openmpi/mca-params.conf file described in this FAQ entry.

The motivation behind AMCA parameter sets came from the realization that for certain applications a large number of MCA parameters are required for the application to run well and/or as the user expects. Since these MCA parameters are application specific (or even application run specific) they should not be set in a global manner, but only pulled in as determined by the user.

MCA parameters set in AMCA parameter files will override any MCA parameters supplied in global parameter files (e.g., $HOME/.openmpi/mca-params.conf), but not command line or environment parameters.

AMCA parameter files are typically supplied on the command line via the -am option.

For example, consider a AMCA parameter file called foo.conf placed in the same directory as the application a.out. A user will typically run the application as:

shell$ mpirun -np 2 a.out

To use the foo.conf AMCA parameter file this command line changes to:

shell$ mpirun -np 2 -am foo.conf a.out

If the user wants to override a parameter set in foo.conf they can add it to the command line as seen below.

shell$ mpirun -np 2 -am foo.conf -mca btl tcp,self a.out

AMCA parameter files can be coupled if more than one file is to be used. If we have another AMCA parameter file called bar.conf that we want to use we add it to the command line as follows:

shell$ mpirun -np 2 -am foo.conf:bar.conf a.out

AMCA parameter files are loaded in priority order. This means that foo.conf AMCA file has priority over the bar.conf file. So if the bar.conf file sets the MCA parameter mpi_leave_pinned=0 and the foo.conf file sets this MCA parameter to mpi_leave_pinned=1 then the latter will be used.

The location of AMCA parameter files are resolved in a similar way as the shell. If no path operator is provided (i.e., foo.conf) then Open MPI will search the $SYSCONFDIR/amca-param-sets directory then the current working directory. If a relative path is specified then only that path will be searched (i.e., ./foo.conf, baz/foo.conf). If an absolute path is specified then only that path will be searched (i.e., /bip/boop/foo.conf).

Though the typical use case for AMCA parameter files is to be specified on the command line, they can also be set as MCA parameters in the environment. The MCA parameter (mca_base_param_file_prefix) contains a ':' separated list of AMCA parameter files exactly as they would be passed to the -am command line option. The MCA parameter (mca_base_param_file_path) specifies the path to search for AMCA files with relative paths. By default this is $SYSCONFDIR/amca-param-sets/:$CWD.


11. How do I select which components are used?

Each MCA framework has a top-level MCA parameter that helps guide which components are selected to be used at run-time. Specifically, there is an MCA parameter of the same name as each MCA framework that can be used to include or exclude components from a given run.

For example, the btl MCA parameter is used to control which BTL components are used (i.e., MPI point-to-point communications; see this FAQ entry for a full list of MCA frameworks). It can take as a value a comma-separated list of components with the optional prefix "^". For example:

# Tell Open MPI to exclude the tcp and openib BTL components
# and implicitly include all the rest
shell$ mpirun --mca btl ^tcp,openib ...

# Tell Open MPI to include *only* the components listed here and
# implicitly ignore all the rest (i.e., the loopback, shared memory,
# and Myrinet/GM MPI point-to-point components):
shell$ mpirun --mca btl self,sm,gm ...

Note that ^ can only be the prefix of the entire value because the inclusive and exclusive behavior are mutually exclusive. Specifically, since the exclusive behavior means "use all components except these," it does not make sense to mix it with the inclusive behavior of not specifying it (i.e., "use all of these components"). Hence, something like this:

shell$ mpirun --mca btl self,sm,openib,^tcp ...

does not make sense because it says both "use only the self, sm, and openib components" and "use all components except tcp" and will result in an error.

Just as with all MCA parameters, the btl parameter (and all framework parameters) in multiple different ways.


12. What is processor affinity? Does Open MPI support it?

Processor affinity is when a process is "bound" to a specific processor. That is, the operating system will only allow that process to run on that processor. On multi-processor machines, this can help improve performance by not letting the operating system move processes between processors. In the worst case, it will simply remove "jitter" from performance characteristics due to the OS moving processes (i.e., performance characteristics should be quite consistent between multiple runs). In the best case, it can dramatically improve performance.

Open MPI supports processor affinity on a variety of systems. You can run the "ompi_info" command and look for "paffinity" components to see if your system is supported. For example:

$ ompi_info | grep paffinity
           MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0)

Note that processor affinity should not be used when a node is over-subscribed (i.e., more processes are launched than there are processors). This can lead to a serious degradation in performance (even more than simply oversubscribing the node). Open MPI will usually detect this situation and automatically disable the use of processor affinity (and display run-time warnings to this effect).

Also see this FAQ entry for how to use processor and memory affinity in Open MPI.


13. What is memory affinity? Does Open MPI support it?

Memory affinity is only relevant for Non-Uniform Memory Access (NUMA) machines, such as "big iron" SGI and Cray machines, or many models of multi-processor Opteron machines. In a NUMA architecture, memory is physically distributed throughout the machine even though it is virtually treated as a single address space. That is, memory may be physically local to one or more processors -- and therefore remote to other processors.

Simply put: some memory will be faster to access (for a given process) than others.

Open MPI supports general and specific memory affinity, meaning that it generally tries to allocate all memory local to the processor that asked for it. When shared memory is used for communication, Open MPI uses memory affinity to make certain pages local to specific processes in order to minimize memory network/bus traffic.

Open MPI supports memory affinity on a variety of systems. You can run the "ompi_info" command and look for "maffinity" components to see if your system is supported. For example:

$ ompi_info | grep maffinity
           MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.0)

Note that memory affinity support is enabled only when processor affinity is enabled. Specifically: using memory affinity does not make sense if processor affinity is not enabled because processes may allocate local memory and then move to a different processor, potentially remote from the memory that it just allocated.

Also see this FAQ entry for how to use processor and memory affinity in Open MPI.


14. How do I tell Open MPI to use processor and/or memory affinity?

Assuming that your system supports processor and memory affinity (check ompi_info for "paffinity" and "maffinity" components), you can explicitly tell Open MPI to use them when running MPI jobs.

Note that memory affinity support is enabled only when processor affinity is enabled. Specifically: using memory affinity does not make sense if processor affinity is not enabled because processes may allocate local memory and then move to a different processor, potentially remote from the memory that it just allocated.

Also note that processor and memory affinity is meaningless (but harmless) on uniprocessor machines.

Open MPI 1.2 only offers coarse-grained controls for processor affinity. As such, it is best if the processes in an Open MPI job using processor affinity are the only intensive processes running on the nodes being used for the job. Specifically, since most schedulers do not (yet) provide information on which processors should be used for specific processes, Open MPI can only assume that its processes are "alone" on the node and it can exclusively claim CPUs starting with the first one.

Hence, if two processor-affinity-enabled jobs are running on the same node, they will both attempt to claim the first processor(s) on the node, resulting in CPU thrashing (and severely degraded performance).

Remember: only set processor affinity if you know that you have sole use of the nodes and you only run one job at a time on those nodes.

Open MPI 1.3 supports run-time environments who automatically tell jobs which processors to run on, but most schedulers who support this both handle processor affinity themselves (therefore making Open MPI's processor affinity support unnecessary) and/or do not indicate to the job which processors should be used in a publicly accessible manner (i.e., Open MPI is not given this information). As such, the workaround for assuming that Open MPI jobs are "alone" on a node seemed a "good enough" workaround in the interim.

To enable processor (and potentially memory) affinity, set the MCA parameter "mpi_paffinity_alone" to 1. For example:

$ mpirun --mca mpi_paffinity_alone 1 -np 4 a.out

(just like any other MCA parameter, mpi_paffinity_alone can be set via any of the normal MCA parameter mechanisms)

Assumedly, this job is running on a single 4-way SMP or two 2-way SMPs. Setting mpi_paffinity_alone will tell Open MPI to bind each process to a specific processor, and if memory affinity is supported, to attempt to use general and specific memory affinity as described in a different FAQ entry.

Finally, note that Open MPI will automatically disable processor affinity on any node that is oversubscribed (i.e., where more Open MPI processes are launched in a single job on a node than it has processors) and will print out warnings to that effect.

Note, however, that processor affinity is not exclusionary with Degraded performance mode. Degraded mode is usually only used when oversubscribing nodes (i.e., running more processes on a node than it has processors -- see this FAQ entry for more details about oversubscribing, as well as a definition of Degraded performance mode). It is possible to manually select Degraded performance mode and use processor affinity as long as you are not oversubscribing.

Open MPI 1.3 and higher also offers a more granular process affinity setting that includes a slot mapping method based on specifications provided in a rankfile. The syntax of the rankfile is similar to that of a hostfile, with the addition of slot specifications for each rank in the following format:

rank N=hostA slot=cpu_num

rank M=hostB slot=socket_num:core_num

Consider the following example:

#mpirun -np 4 -hostfile hostfile -rf rankfile ./app
or 
#mpirun -np 4 -hostfile hostfile --mca rmaps_rank_file_path rankfile ./app

#cat rankfile
rank 0=host1 slot=2
rank 1=host2 slot=1-3,0
rank 2=host4 slot=1:0
rank 4=host3 slot=0:*
rank 3=host5 slot=0:1,1:0-2

This means that:

  • rank 0 will run on host1 bound to CPU2
  • rank 1 will run on host2 bound to CPUs from CPU1 to CPU3 and CPU0
  • rank 2 will run on host4 bound to socket1 core0
  • rank 4 will run on host3 bound to any core on socket0
  • rank 3 will run on host5 bound to socket0:core1 and socket1:core0, socket1:core1, socket1:core2

Notes :

  • it is strongly recommended that you provide full rankfile when using slot mapping affinity setting, otherwise there is a very high probability of processor oversubsribing and performance degradation.
  • the hosts specified in the rankfile must be known to mpirun, either via a list of hosts in a hostfile or as obtained from a resource manager.
  • the number of processes ( np ) must be provided on the mpirun cmd line.
  • By default the numbering of the sockets and cores is given in terms of their logical processor ids, sequentially numbered starting with zero. In most cases, the logical number will directly correlate to the same physical socket and/or core id. However, there are cases where this isn't true, typically due to either unpopulated sockets or idled cores. If you want a process to bind to a specific physical processor in these cases, you can put "p" for "physical" before the socket:core pair. For example, rank 4=host3 slot=p3:2 will bind rank4 to the physical socket3 : physical core2 pair.

If you are running 1 process per host then you can specify the slot directly in the command line by --slot-list

#mpirun -np 4 -hostfile hostfile --slot-list 0:1 ./app

Note that running more than 1 process in each job on the host will cause oversubscribing of the CPU, since all of them will be bound to this processor. (in the example above, to socket0:core1).

Using threads and setting paffinity can be achieved by reserving the same number of slots as the number of threads for each process.

Example: Two threads per process rank 0=host1 slot=0,1

Four threads per process rank 0=host1 slot=0,1,2,3

Note that the threads, however, will not be bound to a specific processor. OMPI only supports process level affinity - thus, all threads from the process will be restricted to the listed processors, but no one thread is bound to any specific processor within that list.


15. Does Open MPI support calling fork() or system() in MPI processes?

It depends on a lot of factors, including (but not limited to):

  • The operating system
  • The underlying compute hardware
  • The network stack
  • Interactions with other middleware in the MPI process

In some cases, Open MPI will determine that it is not safe to fork(). In these cases, Open MPI will register a pthread_atfork() callback to print a warning when the process forks.

This warning is helpful for legacy MPI applications where the current maintainers are unaware SYSTEM is being invoked from an obscure subroutine nestled deep in millions of line of Fortran code (we've seen this kind of scenario many times).

However, this atfork handler can be dangerous because there is no way to unregister an atfork handler. Hence, packages that dynamically open Open MPI's libraries (e.g., Python bindings for Open MPI) may fail if they finalize and unload libmpi, but later call fork. The atfork system will try to invoke Open MPI's atfork handler; nothing good can come of that.

For such scenarios, or if you simply want to disable printing the warning, Open MPI's atfork handler can be disabled with the mpi_warn_on_fork MCA parameter. For example:

shell$ mpirun --mca mpi_warn_on_fork 0 ...

Of course, systems that dlopen libmpi may not use Open MPI's mpirun, and therefore may need to use a different mechanism to set MCA parameters.


16. I want to run some performance benchmarks with Open MPI. How do I do that?

Running benchmarks correctly is an extremely difficult task to do correctly. There are many, many factors to take into account; it is not as simple as just compiling and running a stock benchmark application. This FAQ entry is by no means a definitive guide, but it does try to offer some suggestions for generating accurate, meaningful benchmarks.

  1. Decide exactly what you are benchmarking and setup your system accordingly. For example, if you are trying to benchmark maximum performance, then many of the suggestions listed below are extremely relevant (be the only user on the systems and network in question, be the only software running, use processor affinity, etc.). If you're trying to benchmark average performance, some of the suggestions below may be less relevant. Regardless, it is critical to know exactly what you're trying to benchmark, and know (not guess) both your system and the benchmark application itself well enough to understand what the results mean.

    To be specific, many benchmark applications are not well understood for exactly what they are testing. There have been many cases where users run a given benchmark application and wrongfully conclude that their system's performance is bad -- solely on the basis of a single benchmark that they did not understand. Read the documentation of the benchmark carefully, and possibly even look into the code itself to see exactly what it is testing.

    Case in point: not all ping-pong benchmarks are created equal. Most users assume that a ping-pong benchmark is a ping-pong benchmark is a ping-pong benchmark. But this is not true; the common ping-pong benchmarks tend to test subtly different things (e.g., NetPIPE, TCP bench, IMB, OSU, etc.). Make sure you understand what your benchmark is actually testing.

  2. Make sure that you are the only user on the systems where you are running the benchmark to eliminate contention from other processes.
  3. Make sure that you are the only user on the entire network / interconnect to eliminate network traffic contention from other processes. This is usually somewhat difficult to do, especially in larger, shared systems. But your most accurate, repeatable results will be achieved when you are the only user on the entire network.
  4. Disable all services and daemons that are not being used. Even "harmless" daemons consume system resources (such as RAM) and cause "jitter" by occassionally waking up, consuming CPU cycles, reading or writing to disk, etc. The optimum benchmark system has an absolute minimum number of system services running.
  5. Use processor affinity on multi-processor/core machines to disallow the operating system from swapping MPI processes between processor (and causing unnecessary cache thrashing, for example).

    On NUMA architectures, having the processes getting bumped from one socket to another is more expensive in terms of cache locality (with all of the cache coherency overhead that comes with the lack of it) than in terms of hypertransport routing (see below).

    Non-NUMA architectures such as the Intel Woodcrest have a flat access time to the South Bridge, but cache locality is still important so CPU affinity is always a good thing to do.

  6. Be sure to understand your system's architecture, particularly with respect to the memory, disk, and network characteristics, and test accordingly. For example, on NUMA architectures, most common being Opteron, the South Bridge is connected through a hypertransport link to one CPU on one socket. Which socket depends on the motherboard, but it should be described in the motherboard documentation (it's not always socket 0!). If a process on the other socket needs to write something to a NIC on a PCIE bus behind the South Bridge, it needs to first hop through the first socket. On modern machines (circa late 2006), this hop cost usually something like 100ns (i.e., 0.1 us). If the socket is further away, like in a 4 or 8-socket configuration, there could potentially be more hops, leading to more latency.
  7. Compile your benchmark with the appropriate compiler optimization flags. With some MPI implementations, the compiler wrappers (like mpicc, mpif90, etc.) add optimization flags automatically. Open MPI does not. Add -O or other flags explicitly.

  8. Make sure your benchmark runs for a sufficient amount of time. Short-running benchmarks are generally less accurate because they take fewer samples; longer-running jobs tend to take more samples

  9. If your benchmark is trying to benchmark extremely short events (such as the time required for a single ping-pong of messages):

    • Perform some "warmup" events first. Many MPI implementations (including Open MPI) -- and other subsystems upon which the MPI uses -- may use "lazy" semantics to setup and maintain streams of communications. Hence, the first event (or first few events) may well take significantly longer than subsequent events.
    • Use a high-resolution timer if possible -- gettimeofday() only returns milisecond precision (sometimes on the order of several microseconds).
    • Run the event many, many times (hundreds or thousands, depending on the event and the time it takes). Not only does this provide a more samples, it may also be necessary, especially when the precision of the timer your using may be several orders of magnitude less precise than the even you're trying to benchmark.

  10. Decide whether you are reporting minimum, average, or maximum numbers, and have good reasons why.
  11. Accurately label and report all results. Reproducability is a major goal of benchmarking; benchmark results are effectively useless if they are not precisely labeled as to exactly what they are reporting. Keep a log and detailed notes about the exact system configuration that ou are benchmarking. Note, for example, all hardware and software characteristics (to include hardware, firmware, and software versions as appropriate).