This chapter provides information about tuning applications to improve performance. The topics covered are:
General tuning information applies to all applications running on
HP-UX and SPP-UX platforms. SPP-UX tuning information applies only
to applications running on that platform.
Note: The tuning information in this chapter will improve application performance in most but not all cases. You should use the output from counter instrumentation or XMPI to determine which tuning changes are appropriate.
When you are developing HP MPI applications, several factors can affect performance. These factors include:
Latency is the time between the initiation of the data transfer in the sending process and the arrival of the first byte in the receiving process.
Latency is often dependent upon the length of messages being sent. An application's messaging behavior can vary greatly based upon whether a large number of small messages or a few large messages are sent.
Message bandwidth is the reciprocal of the time needed to transfer a byte. Bandwidth is normally expressed in megabytes per second. Bandwidth becomes important when message sizes are large.
To improve latency or bandwidth or both:
MPI_Pack and MPI_Unpack if possible.
HP MPI optimizes noncontiguous transfers of derived data types.MPI_Send and MPI_Recv each time when one
process communicates with others. Also, use the HP MPI collectives
rather than customizing your own. HP MPI collectives have three-level optimizations when used in a NUMA environment.MPI_ANY_SOURCE may increase latency.MPI_Recv_init and MPI_Startall instead of a loop of
MPI_Irecv calls in cases where requests may not complete
immediately.
For example, suppose you write an application with the following code
section:j = 0
for (i=0; i<size; i++) {
if (i==rank) continue;
MPI_Irecv(buf[i], count, dtype, i, 0, comm, &requests[j++]);
}
MPI_Waitall(size-1, requests, statuses);
MPI_Irecv does not
complete before the next iteration of the loop. In this case, HP MPI
tries to progress both requests. This progression effort could continue
to grow if succeeding iterations also do not complete immediately,
resulting in a higher latency.
However, you could rewrite the code section as follows:
j = 0
for (i=0; i<size; i++) {
if (i==rank) continue;
MPI_Recv_init(buf[i], count, dtype, i, 0, comm, &requests[j++]);
}
MPI_Startall(size-1, requests);
MPI_Waitall(size-1, requests, statuses);
MPI_Recv_init are progressed
just once when MPI_Startall is called. This approach avoids the
additional progression overhead when using MPI_Irecv and can
reduce application latency.
You can use multiple network interfaces for interhost communication while still having intrahost exchanges. In this case, the intrahost exchanges use shared memory between processes mapped to different same-host IP addresses.
To use multiple network interfaces, you must specify which MPI
processes are associated with each IP address in your appfile. To improve
performance, you should use the MPI_TOPOLOGY environment variable
to associate each network interface with the hypernode where it
physically resides on SPPUX.
For example, suppose you have two hosts called host0 and host1 that each communicate using the two HIPPI cards hippi0 and hippi1. Assume the network interfaces are named:
If your executable is called beavis.exe and uses 64 processes, your appfile should contain the following entries:
-h host0-hippi0 -e MPI_TOPOLOGY=/0:16,0 -np 16 beavis.exe -h host0-hippi1 -e MPI_TOPOLOGY=/1:0,16 -np 16 beavis.exe -h host1-hippi0 -e MPI_TOPOLOGY=/0:16,0 -np 16 beavis.exe -h host1-hippi1 -e MPI_TOPOLOGY=/1:0,16 -np 16 beavis.exe
Now, when the appfile is run, 32 processes are run on host0 and 32 processes are run on host1 as shown in Figure 1.
Figure 1 Multiple network interfaces
Host0 processes with rank 0 - 15 communicate with processes with
rank 16 - 31 through shared memory (shmem). Host0 processes also
communicate through the host0-hippi0 network interface with host1
processes.
Subscription refers to the match of processors and active processes on a host or subcomplex. Table 1 lists the possible subscription types.
Subscription type |
Description |
|---|---|
Under subscribed |
More processors than active processes |
Fully subscribed |
Equal number of processors and active processes |
Over subscribed |
More active processes than processors |
When a host or subcomplex is over subscribed, application performance decreases because of increased context switching.
Context switching can degrade application performance by slowing the computation phase, increasing message latency, and lowering message bandwidth. Simulations that use timing-sensitive algorithms can produce unexpected or erroneous results when run on an over-subscribed system.
To achieve the lowest message latencies and highest message
bandwidths for point-to-point synchronous communications, use the MPI
blocking routines MPI_Send and MPI_Recv. For asynchronous
communications, use the MPI nonblocking routines MPI_Isend and
MPI_Irecv.
When using the blocking routines, try to avoid pending requests. MPI must advance nonblocking messages, so calls to blocking receives must advance pending requests occasionally resulting in lower application performance.
For tasks that require collective operations, use the appropriate MPI collective routine. HP MPI takes advantage of shared memory to perform efficient data movement and maximize your application's communication performance.
There are two factors that affect application performance when working with HP MPI applications on the SPP-UX platform. These factors include:
There are several ways to improve the performance of applications that use multilevel parallelism:
Because messaging bandwidth and latency are better within a
hypernode than between hypernodes, you can improve performance by
placing HP MPI processes that communicate heavily on the same
hypernode. One way to do this is to use the MPI_TOPOLOGY environment
variable to tell an application the number of processes to run on each
available hypernode.
For example, suppose you want to run an application on an X-Class server using a subcomplex called System. This subcomplex spans four hypernodes and contains the 20 processors listed below:
Suppose the application you want to run contains the 16 processes listed below:
Ideally, you should use a process placement that allows each set of processes to run on a single hypernode to maximize message-passing performance.
By default, HP MPI places processes by fully subscribing each hypernode before moving on to the next. If the processes in your application are placed using this approach, you get the placement shown in Figure 2.
Figure 2 Default process placement
While this distribution prevents processor oversubscription, it does not provide optimum message-passing performance because the processes from sets two and three are split across hypernodes. Communications within these process groups may become a bottleneck when running the application.
You can solve this problem by specifying the number of processes you want to run on each hypernode as shown below:
This distribution results in a placement shown in Figure 3.
Figure 3 Optimal process placement
To specify this process placement, set MPI_TOPOLOGY by entering:
% setenv MPI_TOPOLOGY 4,0,4,8
For more information, see "MPI_TOPOLOGY".
Note: Make sure you use MPI_TOPOLOGY to place processes doing I/O on the hypernodes hosting the appropriate I/O controller. Placing these processes on noncontroller nodes results in lower I/O performance.