[ Previous Page ] [ Next Page ] [ Contents ] [ Index Page ]
Last modified on: Wednesday, October 15 1997 at 11:11am

5 Tuning

This chapter provides information about tuning applications to improve performance. The topics covered are:

General tuning information applies to all applications running on
HP-UX and SPP-UX platforms. SPP-UX tuning information applies only to applications running on that platform.

Note: The tuning information in this chapter will improve application performance in most but not all cases. You should use the output from counter instrumentation or XMPI to determine which tuning changes are appropriate.

General tuning

When you are developing HP MPI applications, several factors can affect performance. These factors include:

Message latency and bandwidth

Latency is the time between the initiation of the data transfer in the sending process and the arrival of the first byte in the receiving process.

Latency is often dependent upon the length of messages being sent. An application's messaging behavior can vary greatly based upon whether a large number of small messages or a few large messages are sent.

Message bandwidth is the reciprocal of the time needed to transfer a byte. Bandwidth is normally expressed in megabytes per second. Bandwidth becomes important when message sizes are large.

To improve latency or bandwidth or both:

j = 0
for (i=0; i<size; i++) { 
   if (i==rank) continue;
   MPI_Irecv(buf[i], count, dtype, i, 0, comm, &requests[j++]);
}
MPI_Waitall(size-1, requests, statuses);
j = 0
for (i=0; i<size; i++) { 
   if (i==rank) continue;
   MPI_Recv_init(buf[i], count, dtype, i, 0, comm, &requests[j++]);
}
MPI_Startall(size-1, requests);
MPI_Waitall(size-1, requests, statuses);

Multiple network interfaces

You can use multiple network interfaces for interhost communication while still having intrahost exchanges. In this case, the intrahost exchanges use shared memory between processes mapped to different same-host IP addresses.

To use multiple network interfaces, you must specify which MPI processes are associated with each IP address in your appfile. To improve performance, you should use the MPI_TOPOLOGY environment variable to associate each network interface with the hypernode where it physically resides on SPPUX.

For example, suppose you have two hosts called host0 and host1 that each communicate using the two HIPPI cards hippi0 and hippi1. Assume the network interfaces are named:

If your executable is called beavis.exe and uses 64 processes, your appfile should contain the following entries:

-h host0-hippi0 -e MPI_TOPOLOGY=/0:16,0 -np 16 beavis.exe
-h host0-hippi1 -e MPI_TOPOLOGY=/1:0,16 -np 16 beavis.exe
-h host1-hippi0 -e MPI_TOPOLOGY=/0:16,0 -np 16 beavis.exe
-h host1-hippi1 -e MPI_TOPOLOGY=/1:0,16 -np 16 beavis.exe

Now, when the appfile is run, 32 processes are run on host0 and 32 processes are run on host1 as shown in Figure 1.

Figure 1 Multiple network interfaces

(Graphic)

Host0 processes with rank 0 - 15 communicate with processes with
rank 16 - 31 through shared memory (shmem). Host0 processes also communicate through the host0-hippi0 network interface with host1 processes.

Processor subscription

Subscription refers to the match of processors and active processes on a host or subcomplex. Table 1 lists the possible subscription types.

Table 1 Subscription types

Subscription type

Description

Under subscribed

More processors than active processes

Fully subscribed

Equal number of processors and active processes

Over subscribed

More active processes than processors

When a host or subcomplex is over subscribed, application performance decreases because of increased context switching.

Context switching can degrade application performance by slowing the computation phase, increasing message latency, and lowering message bandwidth. Simulations that use timing-sensitive algorithms can produce unexpected or erroneous results when run on an over-subscribed system.

MPI routine selection

To achieve the lowest message latencies and highest message bandwidths for point-to-point synchronous communications, use the MPI blocking routines MPI_Send and MPI_Recv. For asynchronous communications, use the MPI nonblocking routines MPI_Isend and MPI_Irecv.

When using the blocking routines, try to avoid pending requests. MPI must advance nonblocking messages, so calls to blocking receives must advance pending requests occasionally resulting in lower application performance.

For tasks that require collective operations, use the appropriate MPI collective routine. HP MPI takes advantage of shared memory to perform efficient data movement and maximize your application's communication performance.

SPP-UX platform tuning

There are two factors that affect application performance when working with HP MPI applications on the SPP-UX platform. These factors include:

Multilevel parallelism

There are several ways to improve the performance of applications that use multilevel parallelism:

Process placement

Because messaging bandwidth and latency are better within a hypernode than between hypernodes, you can improve performance by placing HP MPI processes that communicate heavily on the same hypernode. One way to do this is to use the MPI_TOPOLOGY environment variable to tell an application the number of processes to run on each available hypernode.

For example, suppose you want to run an application on an X-Class server using a subcomplex called System. This subcomplex spans four hypernodes and contains the 20 processors listed below:

Suppose the application you want to run contains the 16 processes listed below:

Ideally, you should use a process placement that allows each set of processes to run on a single hypernode to maximize message-passing performance.

By default, HP MPI places processes by fully subscribing each hypernode before moving on to the next. If the processes in your application are placed using this approach, you get the placement shown in Figure 2.

Figure 2 Default process placement

(Graphic)

While this distribution prevents processor oversubscription, it does not provide optimum message-passing performance because the processes from sets two and three are split across hypernodes. Communications within these process groups may become a bottleneck when running the application.

You can solve this problem by specifying the number of processes you want to run on each hypernode as shown below:

This distribution results in a placement shown in Figure 3.

Figure 3 Optimal process placement

(Graphic)

To specify this process placement, set MPI_TOPOLOGY by entering:

% setenv MPI_TOPOLOGY 4,0,4,8

For more information, see "MPI_TOPOLOGY".

Note: Make sure you use MPI_TOPOLOGY to place processes doing I/O on the hypernodes hosting the appropriate I/O controller. Placing these processes on noncontroller nodes results in lower I/O performance.

[ Previous Page ] [ Next Page ] [ Contents ] [ Index Page ]