| ACADEMIC COMPUTING and COMMUNICATIONS CENTER | |||||||||
MPICH2 on Argo | ||||||||||||
| Overview | ||||||||||||
|
The MPICH2 is a library from the Argonne National Laboratory (http://www.anl.gov) which is an implementation of the MPI-2 standard. MPI (Message Passing Interface) is a library specification, the foundation of which a group of functions can be used, either in Fortran or C, to achieve parallelism. An MPI function permits one process to talk to another by the transmission of data (messages). |
||||||||||||
| Configuring your environment | ||||||||||||
|
You will make two changes: one to the PATH variable; the other, to the LD_LIBRARY_PATH. The former is mandatory whereas the latter is optional depending on the compiler you use. If your login shell is C, then the changes are made to the .cshrc file. If you use bash, then the changes are to be made to the .bash_profile file. To see which shell is your default, enter:
If the output is /bin/bash, you are using the bash shell; if /bin/csh, then the C shell. For a bash shell user, append the following in .bash_profile:
For a C shell user, append the following in your .cshrc:
Note: If you do not use the GNU compilers (instead, you use the supported Portland Group compilers), then the change to the LD_LIBRARY_PATH is unncessary because you will specify the appropriate library path with the -L option in the compile statement. See example below. |
||||||||||||
| Compiling and Linking | ||||||||||||
|
The following example illustrates compiling using the Portland Group C compiler:
The -L/usr/common/mpich2-1.0.1/lib is required when using Portland
Group compilers; without it, the compiler will default to MPICH version 1 libraries and not version 2.
The following scripts are another way to compile and link your program:
Each script invokes the appropriate GNU compiler.. |
||||||||||||
| Running | ||||||||||||
|
The three basic steps to execute an MPICH2 job:
Steps one and three are executed from the argo command line; step two is executed from a script. An mpd ring is a group of MPI daemons that constitutes the environment in which your MPI job runs. More on this in a moment but, start by creating two files in your home directory:
The configuration file is required and must be named .mpd.conf. It contains a secretword, known only to the daemons in the ring. That word permits the daemons to recognize one another and thus permit communication. Don't forget the leading periond in the name. To create the file: Next, put the following single line in it:cd touch .mpd.conf chmod 600 secretword=XXXXXXXXXX The keyword secretword= is required. Don't enter the string of Xs after the keyword - replace them with your secretword (DON'T use your common password). Examples: secretword=ihatecomputers secretword=chicagocubs The second file is a host file and it identifies the nodes to use in your mpd ring. There is no required name for the file; the name you give it is up to you. If you need only a single host file (more on having multiple host files later), then name it mpd.hosts. To create it: cd touch mpd.hosts chmod 644 mpd.hosts The host file contains those nodes that will be used by BOTH the batch system and MPI to execute your job. For example, to use nodes argo1-1 to argo 1-4, put the following in your mpd.hosts file: argo1-1 argo1-2 argo1-3 argo1-4 You are not required to use the four nodes listed; they are used for example purposes. And, you are not limited to just four nodes; you may use fewer than four or more than four, up to a maximum of sixteen nodes. Each node should be on a separate line with no blank lines included. Very important: NEVER have ALL 64 nodes in a host file. Why? The batch system will allocate all 64 compute nodes to your job even if you limit the number of MPI processes to a subset of the sixty four nodes by means of the -n option on mpiexec. Step 1) Bring up an mpd ring:The following command, executed at the command line and not from a script, starts your ring (each job must have its own ring): What follows the rsh command, the xxxxxx, MUST be one of the nodes in your host file:rsh xxxxxx "/usr/common/mpich2-1.0.1/bin/mpdboot -r rsh -n 4 -f $HOME/mpd.hosts -v"
There is no requirement that the node listed first be used; any of the nodes in the file would have been acceptable. If you elect to call your host file something other than mpd.hosts, you must use that name in the command instead of mpd.hosts. If you host file is named my_hosts, then: rsh argo1-1 "/usr/common/mpich2-1.0.1/bin/mpdboot -r rsh -n 4 -f $HOME/my_hosts -v" One other key point regarding the rsh command. The -n X option indicates to mpdboot the number of mpd daemons to start in the ring. Always specify a value for X that is equal to the number of nodes in your host file:
- If your host file has two nodes listed, then use -n 2. - If your host file has four nodes listed, then use -n 4. - If your host file has six nodes listed, then use -n 6. - If your host file has eight nodes listed, then use -n 8. Three key rules:
If the value is greater, the batch system will assign those nodes to your job but MPI will not use them. Suppose your host file has the following six nodes in it: Now that you started your ring, how do you check its status: rsh xxxxxx "/usr/common/mpich2-1.0.1/bin/mpdtrace" where xxxxxx is replaced by one of the nodes in the host file. Example: rsh argo1-1 "/usr/common/mpich2-1.0.1/bin/mpdtrace" Sample output (these are the four nodes in the host file): argo1-1 If your output from mpdtrace matches the nodes in your host file, then ignore
any errors mpdboot might have been generated. rsh argo2-1 "/usr/common/mpich2-1.0.1/bin/mpdtrace" Node argo2-1 is not in the sample host file. Each of the following is correct: rsh argo1-1 "/usr/common/mpich2-1.0.1/bin/mpdtrace" rsh argo1-2 "/usr/common/mpich2-1.0.1/bin/mpdtrace" rsh argo1-3 "/usr/common/mpich2-1.0.1/bin/mpdtrace" rsh argo1-4 "/usr/common/mpich2-1.0.1/bin/mpdtrace" Step 2) Start your program using mpiexecMpiexec SHOULD NOT BE run from the command line; put it into a script file. Below are two example script files, called x1 and x. They will show how to use scripts to run your program. The two scripts work together; having one without the other will not work. The first one, x1 (highlighted in blue), contains three lines; the second, x (highlighted in green), two:
The first script does three things, one per line:
Now look at the second script. It does two things:
Some IMPORTANT things to remember about the scripts:
mpd.hosts1: mpd.hosts2 mpd.hosts3 argo1-1 argo2-2 argo3-3 argo2-1 argo3-2 argo1-4 argo3-1 argo1-3 argo2-4 argo1-2 argo2-3 argo3-4 argo4-1 argo4-2
To run the script, type: ./x Step 3) Stop the mpd ringOnce your scripts have completed executing, you should stop the mpd ring. Execute the following command from the argo command prompt: DO NOT leave your ring running after your job completes:
![]() The process on the first node in the host file is the coordinating process. For whatever reason, the p1 process (on the coordinating node) dies but the other processes (p2, p3, and p4), continue running, now as orphans. As a result of your coordinating process ending, your MPI job is removed from batch and will not appear when you run the batch qstat command. Your environment appears to be clean but it is not; you still have running processes and there are consequences. If you bring down your ring, the orphans will be terminated. But, you elect not to bring down the ring because stopping and starting you ring involves two extra steps. Then, you submit another job (job #2). The coordinating process for the job will, like its predecessor, run on argo1-1. But, because there are processes ALREADY running on the other nodes in the ring, the coordinating process will not assign work to them; it will assign work only to free nodes. Since the coordinating node is the only free node, all four processes run on it, an action which will have major (NEGATIVE) performance consequences. Always bring down your ring. |
||||||||||||
| Advanced Job Control | ||||||||||||
|
The examples presented assume:
However, argo is not a homogenous environment. Some nodes have a single CPU while others have two CPUs; future nodes will have dual core, dual CPUs. Click here to see the architecture of each compute node.
There is no restriction that each node in a host file have the same number of
processors as the others. For example, your host file could have one node
having a single CPU and three others having two CPUs.
What follows is a presentation on how to run your MPI job on nodes having multiple CPUs or multiple cores. There are two approaches to using multiple CPUs:
The Basic MethodThe basic method involves just a slight change to what has already been presented. It permits you to decide which nodes are used by means of the number and type of nodes you place in your host file as well as how many processes you elect to spawn. Some examples will clarify. Assume for the sake of the examples that each of the nodes you elected to place in your host file has two CPUs. That means that there are a total of eight physical CPUs available, a key fact to remember. For purposes of continuity, the host file will contain four nodes, argo1-1 through argo1-4. Those are the same nodes used in all previous examples though those examples made it appear as if they were single processor machines.Example 1 - fewer processes than CPUs
then each of the processes is started on ONE of the two CPUs on the four nodes. The MPI daemon will put a process on each node since the number of processes matches the number of available nodes. But, because the total of available CPUs on the four nodes exceeds the number of processes (eight CPUs and four processes), which CPU is selected is not something you can dictate; that decision rests with the MPI daemon and the scheduler. The four processes are identified as p1, p2, p3, and p4):
Process p2 starts its execution on CPU2 on argo1-2 Process p3 starts its execution on CPU2 on argo1-3 Process p4 starts its execution on CPU1 on argo1-4 Example 2 - processes equal the number of CPUsIf you spawn eight processes by specifying -n 8 on the mpiexec command, then a single process is started on each CPU on the four nodes. That's because the number of available CPUs (8) equals the number of processes (8).
For example purposes only, the processes and the nodes were ordered:
process p2 on the second CPU on argo1-1 process p3 on the first CPU on argo1-3 ... and so on
Example 3 - more processes than CPUsIf you spawn ten processes by specifying -n 10 on the mpiexec command, then the number of available CPUs exceeds the number of processes (ten processes and eight CPUs). Six of the eight CPUs will have a single process and the other two will each have two processes.
The two overlapping processes, in this case p9 and p10, execute on the same node (argo1-1). That is not a requirement; the scheduler and the MPI daemon could have selected different nodes for the two (p9 executes on CPU1 on node argo1-1 while p10 executes on CPU2 on argo1-3:
The Advanced MethodThe advanced method permits you to control how many processes run on nodes. In order to have that level of control, a new file is introduced. The name of the new file is left to you but, for example purposes herein, the file will be called mpich2.config. Entries identify the node to use, the number of processes to run on the node, and the program to execute. The config file does not replace the host file; it's in addition to it. The generic syntax of an entry in the config file is:
A sample file:
-host argo1-2 -np 2 $HOME/mpich2_test/merge_mpi -host argo1-3 -np 2 $HOME/mpich2_test/merge_mpi -host argo1-4 -np 1 $HOME/mpich2_test/merge_mpi Now, the sample config file should be easy to understand. Each line indicates the number of processes to execute on each node. However, the syntax of mpiexec command, as presented in all previous examples, is no longer applicable:
The config results in six processes (labelled p1, p2, p3, p4, p5, and p6). If the nodes in the host file are single CPUs machines (they are not but, for example purposes, assume they are), then the following shows which nodes execute one process and which execute two:
If the four nodes did have two CPUs, then the following occurs:
The p1 process on argo1-1 is shown as executing on CPU1 and the p6 process is shown to be on CPU2, argo1-4. There is no way to know to which CPU the process will be initially assigned. And, as discussed in the Basic Section, processes are constantly swapped out and swapped back in; there is no guarantee that a process swapped back in will return to the same CPU it was executing on before being swapped out (though a process must return to the same node). Some things to noteSome things to remember when using either the Basic or Advanced Method:
|
||||||||||||
| Help | ||||||||||||
|
||||||||||||
| 2007-6-5 ACCC Systems Group |
|