ACCC Home Page ACADEMIC COMPUTING and COMMUNICATIONS CENTER
Accounts / Passwords Email Labs / Classrooms Telecom Network Security Software Computing and Network Services Education / Teaching Getting Help
 

MPICH2 on Argo

   
 
     
Overview
 

The MPICH2 is a library from the Argonne National Laboratory (http://www.anl.gov) which is an implementation of the MPI-2 standard. MPI (Message Passing Interface) is a library specification, the foundation of which a group of functions can be used, either in Fortran or C, to achieve parallelism. An MPI function permits one process to talk to another by the transmission of data (messages).

 
     
Configuring your environment
 

You will make two changes: one to the PATH variable; the other, to the LD_LIBRARY_PATH. The former is mandatory whereas the latter is optional depending on the compiler you use. If your login shell is C, then the changes are made to the .cshrc file. If you use bash, then the changes are to be made to the .bash_profile file. To see which shell is your default, enter:

echo $SHELL

If the output is /bin/bash, you are using the bash shell; if /bin/csh, then the C shell.

For a bash shell user, append the following in .bash_profile:

export MPICH2_HOME=/usr/common/mpich2-1.0.1
export PATH=$MPICH2_HOME/bin:$PATH

export LD_LIBRARY_PATH=/usr/common/mpich2-1.0.1/lib:$LD_LIBRARY_PATH

For a C shell user, append the following in your .cshrc:

setenv MPICH2_HOME /usr/common/mpich2-1.0.1
setenv PATH /usr/common/mpich2-1.0.1/bin:$PATH

setenv LD_LIBRARY_PATH /usr/common/mpich2-1.0.1/lib:$LD_LIBRARY_PATH

Note: If you do not use the GNU compilers (instead, you use the supported Portland Group compilers), then the change to the LD_LIBRARY_PATH is unncessary because you will specify the appropriate library path with the -L option in the compile statement. See example below.

 
     
Compiling and Linking
 

The following example illustrates compiling using the Portland Group C compiler:

pgcc -o mpihello testmpi.c -D REENTRANT -L/usr/common/mpich2-1.0.1/lib -I/usr/common/mpich2-1.0.1/include/ -lmpich -Wl -t

The -L/usr/common/mpich2-1.0.1/lib is required when using Portland Group compilers; without it, the compiler will default to MPICH version 1 libraries and not version 2.

The -I/usr/common/mpich2-1.0.1/include/ identifies the location of the header files.

The optional -Wl -t displays the libraries used (warning: the list is long):

/usr/bin/ld: mode elf_i386
/usr/lib/crt1.o
/usr/lib/crti.o
/usr/lib/gcc-lib/i386-redhat-linux/2.96/crtbegin.o
/tmp/pgccbaaaanksab.o
(/usr/common/mpich2-1.0.1/lib/libmpich.a)comm_rank.o
(/usr/common/mpich2-1.0.1/lib/libmpich.a)comm_size.o
(/usr/common/mpich2-1.0.1/lib/libmpich.a)commutil.o
(/usr/common/mpich2-1.0.1/lib/libmpich.a)errutil.o
(/usr/common/mpich2-1.0.1/lib/libmpich.a)grouputil.o
(/usr/common/mpich2-1.0.1/lib/libmpich.a)init.o
(/usr/common/mpich2-1.0.1/lib/libmpich.a)initthread.o
...

The following scripts are another way to compile and link your program:

Script
Language
mpicc
C
mpicxx
C++
mpif77
Fortran 77
mpi90
Fortran 90

Each script invokes the appropriate GNU compiler..

 
     
Running
 

The three basic steps to execute an MPICH2 job:

  1. Bring up an mpd ring,
  2. Start your MPI program using the mpiexec (eg, hello world), and
  3. Bring down the mpd ring after your program terminates.

Steps one and three are executed from the argo command line; step two is executed from a script.

An mpd ring is a group of MPI daemons that constitutes the environment in which your MPI job runs. More on this in a moment but, start by creating two files in your home directory:

  1. The daemon configuration file called .mpd.conf, and
  2. A host file.

The configuration file is required and must be named .mpd.conf. It contains a secretword, known only to the daemons in the ring. That word permits the daemons to recognize one another and thus permit communication. Don't forget the leading periond in the name.

To create the file:

cd
touch .mpd.conf
chmod 600
Next, put the following single line in it:
secretword=XXXXXXXXXX

The keyword secretword= is required. Don't enter the string of Xs after the keyword - replace them with your secretword (DON'T use your common password). Examples:

secretword=ihatecomputers
secretword=chicagocubs

The second file is a host file and it identifies the nodes to use in your mpd ring. There is no required name for the file; the name you give it is up to you. If you need only a single host file (more on having multiple host files later), then name it mpd.hosts. To create it:

cd touch
mpd.hosts
chmod 644 mpd.hosts

The host file contains those nodes that will be used by BOTH the batch system and MPI to execute your job. For example, to use nodes argo1-1 to argo 1-4, put the following in your mpd.hosts file:

argo1-1
argo1-2
argo1-3
argo1-4

You are not required to use the four nodes listed; they are used for example purposes. And, you are not limited to just four nodes; you may use fewer than four or more than four, up to a maximum of sixteen nodes. Each node should be on a separate line with no blank lines included.

Very important: NEVER have ALL 64 nodes in a host file. Why? The batch system will allocate all 64 compute nodes to your job even if you limit the number of MPI processes to a subset of the sixty four nodes by means of the -n option on mpiexec.

Step 1) Bring up an mpd ring:

The following command, executed at the command line and not from a script, starts your ring (each job must have its own ring):

rsh xxxxxx "/usr/common/mpich2-1.0.1/bin/mpdboot -r rsh -n 4 -f $HOME/mpd.hosts -v"
What follows the rsh command, the xxxxxx, MUST be one of the nodes in your host file:

text

There is no requirement that the node listed first be used; any of the nodes in the file would have been acceptable.

If you elect to call your host file something other than mpd.hosts, you must use that name in the command instead of mpd.hosts. If you host file is named my_hosts, then:

rsh argo1-1 "/usr/common/mpich2-1.0.1/bin/mpdboot -r rsh -n 4 -f $HOME/my_hosts -v"

One other key point regarding the rsh command. The -n X option indicates to mpdboot the number of mpd daemons to start in the ring. Always specify a value for X that is equal to the number of nodes in your host file:

text
- If your host file has two nodes listed, then use -n 2.
- If your host file has four nodes listed, then use -n 4.
- If your host file has six nodes listed, then use -n 6.
- If your host file has eight nodes listed, then use -n 8.

Three key rules:

  • The maximum number for -n is 16
  • Leaving your ring running after the completion of your job may negatively impact the performance of subsequent jobs (step three shows you how to bring down your ring)
  • DO NOT USE A VALUE FOR -n that is less than or greater than number of nodes in your host file:
  • If the value is greater, the batch system will assign those nodes to your job but MPI will not use them. Suppose your host file has the following six nodes in it:
       argo1-1
       argo1-2
       argo1-3
       argo1-4
       argo2-1
       argo2-2
    If you specify 4 as the value to -n, MPI will use the first four nodes in your hosts file (argo1-1, argo1-2, argo1-3, and argo1-4). The remaining two nodes (argo2-1 and argo2-2) will be assigned to your job and will remain allocated to your job until your job completes, but no MPI process will run on them. A waste of two resources.

    If the value is less, then your job will not run and you will get the message:
       there are not enough hosts on which to start all processes

Now that you started your ring, how do you check its status:

rsh xxxxxx "/usr/common/mpich2-1.0.1/bin/mpdtrace"

where xxxxxx is replaced by one of the nodes in the host file. Example:

rsh argo1-1 "/usr/common/mpich2-1.0.1/bin/mpdtrace"

Sample output (these are the four nodes in the host file):

argo1-1
argo1-4
argo1-2
argo1-3

If your output from mpdtrace matches the nodes in your host file, then ignore any errors mpdboot might have been generated.
A common mistake is to check the status of the ring via a node not in your host file. For example:

rsh argo2-1 "/usr/common/mpich2-1.0.1/bin/mpdtrace"

Node argo2-1 is not in the sample host file. Each of the following is correct:

rsh argo1-1 "/usr/common/mpich2-1.0.1/bin/mpdtrace"
rsh argo1-2 "/usr/common/mpich2-1.0.1/bin/mpdtrace"
rsh argo1-3 "/usr/common/mpich2-1.0.1/bin/mpdtrace"
rsh argo1-4 "/usr/common/mpich2-1.0.1/bin/mpdtrace"

Step 2) Start your program using mpiexec

Mpiexec SHOULD NOT BE run from the command line; put it into a script file. Below are two example script files, called x1 and x. They will show how to use scripts to run your program. The two scripts work together; having one without the other will not work.

The first one, x1 (highlighted in blue), contains three lines; the second, x (highlighted in green), two:

#!/bin/csh
set nodes = `perl -e 'while (<STDIN>){chop;$a.="$_+"}chop($a);
print $a;' $HOME/mpd.hosts` qsub -l nodes=$nodes


#!/bin/csh
mpiexec -n 4 $HOME/mpi/mpihello_pgcc_mpi2libs

The first script does three things, one per line:

  • The first line (#!/bin/csh) starts a separate shell, in this case a C Shell, under which the next two commands in the script are executed.
  • The second line (set nodes = ...) parses the contents of your host file and sets the variable, named nodes, to the results of the parsing.
  • The third line (qsub -l ...) invokes the qsub command to run the second script, the one called x.

Now look at the second script. It does two things:

  • The first line (#!/bin/csh) starts a separate shell under which the next command, the mpiexec, is executed.
  • The second line (mpiexec -n 4 ...) executes your program in the MPI environment.

Some IMPORTANT things to remember about the scripts:

  • You are not required to use the names x and x1. They were selected just for example purposes; you may call your scripts anything you want. If you rename the second script to something other than x, then change the qsub statement in your first script to reference the new name. For example, if you rename the first script to run1 and the second script to runit, then change the qsub statement to:

      qsub -l nodes=$nodes <./runit
  • Some VERY IMPORTANT things:
    • Note the relationship among the number of nodes in host file, the number of mpd daemons in the ring, the "set nodes =" statement in the x1 script, and the operand to the -n option in mpiexec:

      • There are four nodes in the example host file (argo1-1, argo1-2, argo1-3, and argo1-4).
      • The number of mpd daemons started in the ring is four, indicated by the -n 4 in the rsh command and equal to the number of nodes in the host file.
      • The 'set = nodes' which will tell the batch system how many nodes to allocate to the job (see the sample x1 script) will be set to:
      • set nodes = argo1-1+argo1-2+argo1-3+argo1-4

      • The number of MPI processes as indicated by the -n 4 in the mpiexec is four. The "set node" statement parses the contents of the host file. If you have all 64 compute nodes in the file, then the set node statement is constructed with all 64 nodes. It is the "set nodes" statement that tells batch how many nodes to assign to your job. That is why you should never have more than 16 nodes, the MAXIMUM, in the file.
    • Do not use the same host file for multiple concurrent jobs. If you want to run a second MPI job, use a different host file and specify hosts that are not in the first host file. The same is true for a third host file - place nodes in it not listed in the first two host files. For example, the following are three separate host files, each with four nodes:
       mpd.hosts1:		mpd.hosts2        	mpd.hosts3
       argo1-1		argo2-2			argo3-3
       argo2-1		argo3-2			argo1-4
       argo3-1		argo1-3			argo2-4
       argo1-2		argo2-3			argo3-4
    		  				argo4-1
    	    					argo4-2
    

    If you use multiple host files for concurrent MPI jobs, then rings must be started on each:

    rsh argo1-1 "/usr/common/mpich2-1.0.1/bin/mpdboot -r rsh -n 4 -f $HOME/mpd.hosts1 -v"
    rsh argo2-1 "/usr/common/mpich2-1.0.1/bin/mpdboot -r rsh -n 4 -f $HOME/mpd.hosts2 -v"
    rsh argo3-1 "/usr/common/mpich2-1.0.1/bin/mpdboot -r rsh -n 6 -f $HOME/mpd.hosts3 -v"
    

    Notice that the first two commands use all the nodes from their respective host files, mpd.hosts1 and mpd.hosts2. Hence, the 4 after the -n option in both comands. But, the third ring will use the six nodes in mpd.hosts3, thus the -n 6 in the last command. As was stated, do not have the same node or nodes in multiple host files; the following is a no-no:

    mpd.hosts1		mpd.hosts2		mpd.hosts3
    argo1-1			argo1-1			argo1-1
    argo1-2			argo1-2			argo1-2
    argo1-3			argo1-3			argo3-1
    argo1-4			argo1-4			argo3-2
    						argo3-3
    						argo3-4 
    

    • You must restart your ring for any changes to your host file to be reflected in the ring. And, you can't dynamically change the nodes in a currently-running ring. Two examples will clarify.

        Example 1: You elect to leave your ring up between job submissions. Your host file, mpd.hosts1 (see above), had four nodes in it when your started the ring prior to submitting job number one. The job completes but you leave the ring up. You make changes to the mpd.hosts1 file, adding nodes argo16-1 and argo16-2 to it. You submit job number two. Your ring will contain and use ONLY the original four nodes and not six. Changes to a host file are NOT made to a ring that is UP. You must bounce the ring - bring it down and bring it up for those changes to be part of your ring. Wait until your job terminates and stop the ring. Then, before submitting your next job, restart the ring.

        Example 2: Your ring is up and your MPI job is running. The ring has the four nodes from mpd.hosts1 in it. While your job is running, you want to add two additional nodes to the running ring. You can't do it. You must bring the ring down. Bringing a ring down while a job is running terminates the job.

    • Do not use a mix-mode environment. The examples assume your login shell is C. If you are using bash, then you must make some changes to both scripts. In your two scripts, use the same SHELL as your login.

    To run the script, type: ./x

    Step 3) Stop the mpd ring

    Once your scripts have completed executing, you should stop the mpd ring. Execute the following command from the argo command prompt:

    rsh argo1-1 /usr/common/mpich2-1.0.1/bin/mpdallexit

    DO NOT leave your ring running after your job completes:
    • A running ring not used by a currently-running MPI job wastes resources that could be used by others.
    • Any changes to your hosts file will not be reflected in a running ring.
    • Rings left running after your job completes may negatively impact the performance of your future MPI jobs.
    Suppose you have a running job (call it job #1) with single processes on each of the four nodes listed in the example host file:

    text

    The process on the first node in the host file is the coordinating process. For whatever reason, the p1 process (on the coordinating node) dies but the other processes (p2, p3, and p4), continue running, now as orphans. As a result of your coordinating process ending, your MPI job is removed from batch and will not appear when you run the batch qstat command. Your environment appears to be clean but it is not; you still have running processes and there are consequences. If you bring down your ring, the orphans will be terminated. But, you elect not to bring down the ring because stopping and starting you ring involves two extra steps. Then, you submit another job (job #2). The coordinating process for the job will, like its predecessor, run on argo1-1. But, because there are processes ALREADY running on the other nodes in the ring, the coordinating process will not assign work to them; it will assign work only to free nodes. Since the coordinating node is the only free node, all four processes run on it, an action which will have major (NEGATIVE) performance consequences. Always bring down your ring.
  •  
         
    Advanced Job Control
     
    The examples presented assume:
    • Argo is a homogenous environment - each machine has the same number of processors as the others, and
    • Each node listed in a host file has the same number of processors as the other nodes.
    And, as discussed, an MPI job is split into instances (processes) where each instance runs on ONE node. The sample host file had four nodes (argo1-1, argo1-2, argo1-3, and argo1-4) and a job was initiated with -n 4 on the mpiexec command. The configuration resulted in four processes (labelled p1, p2, p3, and p4), each running on a single CPU (labelled cpu1) per each of the four nodes:

    text

    However, argo is not a homogenous environment. Some nodes have a single CPU while others have two CPUs; future nodes will have dual core, dual CPUs. Click here to see the architecture of each compute node.

    There is no restriction that each node in a host file have the same number of processors as the others. For example, your host file could have one node having a single CPU and three others having two CPUs.

    text

    The actual argo1-1 machine is in fact, a dual CPU machine as is argo1-2, argo1-3, and argo1-4

    What follows is a presentation on how to run your MPI job on nodes having multiple CPUs or multiple cores. There are two approaches to using multiple CPUs:

    • a basic way which is easy to use but gives you less control, and
    • a more advanced method, giving you more control but requiring more work.

    The Basic Method

    The basic method involves just a slight change to what has already been presented. It permits you to decide which nodes are used by means of the number and type of nodes you place in your host file as well as how many processes you elect to spawn. Some examples will clarify. Assume for the sake of the examples that each of the nodes you elected to place in your host file has two CPUs. That means that there are a total of eight physical CPUs available, a key fact to remember. For purposes of continuity, the host file will contain four nodes, argo1-1 through argo1-4. Those are the same nodes used in all previous examples though those examples made it appear as if they were single processor machines.

    Example 1 - fewer processes than CPUs

    text

    then each of the processes is started on ONE of the two CPUs on the four nodes. The MPI daemon will put a process on each node since the number of processes matches the number of available nodes. But, because the total of available CPUs on the four nodes exceeds the number of processes (eight CPUs and four processes), which CPU is selected is not something you can dictate; that decision rests with the MPI daemon and the scheduler. The four processes are identified as p1, p2, p3, and p4):

    text
      Process p1 starts its execution on CPU1 on argo1-1
      Process p2 starts its execution on CPU2 on argo1-2
      Process p3 starts its execution on CPU2 on argo1-3
      Process p4 starts its execution on CPU1 on argo1-4
    During program execution, your processes will be constantly swapped out and swapped back in, that's just the way it is in a multitasking, multiprogramming environment. However, there is no guarantee that a process swapped back in will return to the same CPU it was executing on before being swapped out (though a process must return to the same node).

    Example 2 - processes equal the number of CPUs

    If you spawn eight processes by specifying -n 8 on the mpiexec command, then a single process is started on each CPU on the four nodes. That's because the number of available CPUs (8) equals the number of processes (8).

    text

    For example purposes only, the processes and the nodes were ordered:
      process p1 on the first CPU on argo1-1
      process p2 on the second CPU on argo1-1
      process p3 on the first CPU on argo1-3
      ... and so on
    The ordering is not guaranteed; which process goes to which node is not something you can dictate. It doesn't follow that the first process MUST go to the first CPU on the first node; the second process to the second CPU on the first node, and so on. The processes could just has easily been:

    text

    Example 3 - more processes than CPUs

    If you spawn ten processes by specifying -n 10 on the mpiexec command, then the number of available CPUs exceeds the number of processes (ten processes and eight CPUs). Six of the eight CPUs will have a single process and the other two will each have two processes.

    text

    The two overlapping processes, in this case p9 and p10, execute on the same node (argo1-1). That is not a requirement; the scheduler and the MPI daemon could have selected different nodes for the two (p9 executes on CPU1 on node argo1-1 while p10 executes on CPU2 on argo1-3:

    text

    The Advanced Method

    The advanced method permits you to control how many processes run on nodes. In order to have that level of control, a new file is introduced. The name of the new file is left to you but, for example purposes herein, the file will be called mpich2.config. Entries identify the node to use, the number of processes to run on the node, and the program to execute. The config file does not replace the host file; it's in addition to it. The generic syntax of an entry in the config file is:

    text

    A sample file:
      -host argo1-1 -np 1 $HOME/mpich2_test/merge_mpi
      -host argo1-2 -np 2 $HOME/mpich2_test/merge_mpi
      -host argo1-3 -np 2 $HOME/mpich2_test/merge_mpi
      -host argo1-4 -np 1 $HOME/mpich2_test/merge_mpi
    Let's take a closer look at the syntax. The first field is a keyword, -host. The second field is the operand to the keyword -host and it is the name of one of the nodes in your host file. The third field is a keyword, -np. The fourth field is the number of processes to execute on the host named in the second field. The fifth field is the path to and name of the program to execute.

    Now, the sample config file should be easy to understand. Each line indicates the number of processes to execute on each node. However, the syntax of mpiexec command, as presented in all previous examples, is no longer applicable:

      WRONG: mpiexec -n X path_and_name_of_my_program
        For example:
          mpiexec -n 4 $HOME/mpich2_test/merge_mpi
    Instead, the following generic format is used:
      RIGHT: mpiexec -configfile $HOME/path_to_config_file/name_of_config_file
        For example:
          mpiexec -configfile $HOME/mpich2_test/mpich2.config
    Notice the two differences:
    • The -n X is no longer used and is replaced by the keyword -configfile, and
    • the path and name of the program to execute is replaced by the path and name of the config file.
    The following diagram puts all the pieces together:

    text


    The config results in six processes (labelled p1, p2, p3, p4, p5, and p6). If the nodes in the host file are single CPUs machines (they are not but, for example purposes, assume they are), then the following shows which nodes execute one process and which execute two:

    text

    If the four nodes did have two CPUs, then the following occurs:

    text

    The p1 process on argo1-1 is shown as executing on CPU1 and the p6 process is shown to be on CPU2, argo1-4. There is no way to know to which CPU the process will be initially assigned. And, as discussed in the Basic Section, processes are constantly swapped out and swapped back in; there is no guarantee that a process swapped back in will return to the same CPU it was executing on before being swapped out (though a process must return to the same node).

    Some things to note

    Some things to remember when using either the Basic or Advanced Method:
    • If you cut and paste the config file from the Advanced Method section or the mpiexec command in the Basic Method, replace the sample program (mpich2_test/merge_mpi) with your own program;
    • You are not limited to the sample nodes (argo1-1 ... argo1-4) in the host and the config files. The four nodes were randomly selected; there is no material reason why they were selected for the examples. You may use them or you may use others. Don't limit yourself to them because they are in the examples (a VERY common mistake resulting in poor program performance; the four sample nodes will be heavily used while other nodes go unused);
    • Your ring must be brought up before you submit the mpixec command and the ring must be brought down upon program termination (NEVER leave your ring up between program submissions);
    • Do not include nodes in your host file that you don't include in the config file (Advanced Method); it's a waste of resources, resources that could be used by somebody else; and
    • Never have more than 16 nodes in your host file.
     
         
    Help
     

    General MPICH2 information can be found at:

    http://www-unix.mcs.anl.gov/mpi/mpich2/
    http://www-unix.mcs.anl.gov/mpi/mpich2/developer.htm

    Message Passing Tutorial (VERY GOOD):

    http://www.llnl.gov/computing/tutorials/mpi/

    Message Passing Interface Forum:

    http://www.mpi-forum.org/

    MPI-2: Extensions to the Message-Passing Interface:

    http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html

     


    2007-6-5  ACCC Systems Group
    UIC Home Page Search UIC Pages Contact UIC