ACCC Home Page ACADEMIC COMPUTING and COMMUNICATIONS CENTER
Accounts / Passwords Email Labs / Classrooms Telecom Network Security Software Computing and Network Services Education / Teaching Getting Help
 
High Performance Computing: The ACCC Cluster Argo-new
Contents Overview Getting Started Available Software
Running Jobs MPI Glossary More Info

ARGO-NEW: Running Jobs

   
   
 
     
Overview
 

How one runs a program on a cluster is VERY DIFFERENT from how one runs a job on a single machine with one or multiple CPUs (for example, tigger).

To begin with, you do not run your executable on the machine (the master) where you create the executable. The following point cannot be emphasized enough:

    The master node is ONLY for job creation and not job execution.
Compute nodes RUN user jobs - THAT'S ALL THEY DO. When you want to run a job, you submit it to torque using the qsub command (more on qsub below). Torque, then, runs your program on one or more compute node.

There are monitors that alert systems to user programs running on the master. Running a program on the master is a violation of ACCC policy and can result in suspension and termination of your argo account.

There are two types of programs that may be executed on the cluster:

  • Sequential (also known as serial)
  • Parallel

For the purposes of the ACCC cluster, a sequential job is a single instance program that runs on one and only one node.

A parallel job is composed of:

  • Multiple instances of the same program running on different nodes with no internodal communication among the program instances (unfortunately, this model of execution is sometimes derisively referred to as embarrassingly parallel), or
  • One or multiple instances of different programs running on different nodes with internodal communication among the instances.

Serial version of the classic hello_world program - source in C

#include <stdio.h>
void main(int argc, char** argv) {
    printf("Hello-world\n");
  }

Parallel version of the classic hello_world program using MPI - source in C

#include <stdio.h>
#include "mpi.h"

void main(int argc, char **argv) {
    int rank;
    int size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    printf("Hello-world, I'm rank %d; Size is %d\n", rank, size);
    MPI_Finalize();
  }

 
     
Torque
 

Torque is a networked subsystem for submitting, monitoring, and controlling a workload of jobs on the cluster. ALL USER JOBS MUST be run via torque.

Years ago, only batch jobs could execute on the cluster. THAT IS NOT THE CASE NOW: torque does not restrict jobs to just batch execution; interactive jobs with GUIs and users interacting with the GUI may be run.

 
     
Queues
 

Jobs (programs) are submitted to queues for execution. There is one available queue (others may be added if the need arises):

  • batch
 
     
Environmental Variables
 

There are two environments, each with its own define variables, available to you:

  • Your shell environment variables, and
  • torque environmental variables

Shell environmental variables

To see a list of your shell environmental variables, type env | more at your shell prompt. To pass ALL the variables (not just a subset) to your job, include the -V option on the qsub command.

Torque environmetal variables

Every user job has the following torque enviromental variables available to it:

Variable
Description
PBS_ENVIRONMENT Set to PBS_BATCH to indicate that the job is a batch job; otherwise, set to PBS_INTERACTIVE to indicate that the job is a PBS interactive job.
PBS_JOBID The job identifier assigned to the job by the batch system.
PBS_JOBNAME The jobname supplied by the user.
PBS_NODEFILE The name of the file that contains the list of the nodes assigned to the job.
PBS_QUEUE The name of the queue from which the job is executed.
PBS_O_HOME The value of the HOME variable in the environment in which qsub was executed.
PBS_O_LANG The value of the LANG variable in the environment in which qsub was executed.
PBS_O_LOGNAME The value of the LOGNAME variable in the environment in which qsub was executed.
PBS_O_PATH The value of the PATH variable in the environment in which qsub was executed.
PBS_O_MAIL The value of the MAIL variable in the environment in which qsub was executed.
PBS_O_SHELL The value of the SHELL variable in the environment in which qsub was executed.
PBS_O_TZ The value of the TZ (time zone) variable in the environment in which qsub was executed.
PBS_O_HOST The name of the host upon which the qsub command is running.
PBS_O_QUEUE The name of the queue to which the job was submitted.
PBS_O_WORKDIR The absolute path of the current working directory of the qsub command.

 
     
Commands
 

The following five commands are important and you will use them often:

Command
What it does:
Man page available Examples
qsub
Run my job/program
Yes Yes
qstat
Show me the status of my running job(s)
Yes Yes
tracejob
Show me information about my running job
No Yes
qdel
Cancel my job
Yes No
qnodes
Tell me what compute nodes are available for use
No Yes
 
     
Job Output and Management
 

After submitting the job, a job id is assigned in the format: xxx.argo-new.cc.uic.edu where xxx is the job-id.

To see the status of your job, use: qstat job-id

Assume your job is assigned job id 338. (You don't need the stuff after the number.) Then you'd use: qstat 338

For stdout and stderr, batch creates two files. The names of the files are constructed from the job name, the letter e (for stderr) or o (for stdout), and the job number. So for your hello world run that had job-id 338, you would have the following files:

hello.o338 <-- this is stdout
hello.e338 <-- this is stderr

Let's take a look:

    -rw------- 1 mhoma sys   0 Jan 17 13:43 hello.e338
    -rw------- 1 mhoma sys 110 Jan 17 13:43 hello.o338

Well the error file is empty so that's a good sign. Let's see what we have:

cat hello.o338

Gives:

My process rank ==> 4
My process rank ==> 3
My process rank ==> 2
My process rank ==> 1

And, that's what we should have.

 
     
Node Selection and Properties
 

Every node has multiple properties associated with it. The property that clients are most familiar is the node name and it serves as the most-commonly used criteria for selecting a node. Other properties may be used to identify nodes. The following table list all the properties associated with the compute nodes:

Nodes
Properties
argo16-4 => argo9-1 argoX-X,Linux2.i86pc,MPICH,cpu.xeon,smp
argo8-4 => argo5-1 argoX-X,Linux2.i86pc,MPICH,cpu.xeon,no.smp
argo1-1 => argo4-4 argoX-X,Linux2.i86pc,MPICH,cpu.amd,smp

The property cpu.XXXXXX gives the type of processor on the machine. Currently, there are are two types of processors available: cpu.amd (for AMD Opteron) and cpu.xeon (for an Intel Xeon).

The property smp identifies machines that are dual processors whereas the property no.smp means a uniprocessor.

The generic sytax of the qsub command is:

qsub -l nodes=node_spec[+node_spc...]

where node_spec is:

number | property[:property...] | number:property[:property...]

A series of examples follows:

What I want to do?
Format of the qsub command
Run my program on any four nodes qsub -l nodes=4 my_program
Run my program on any four nodes that have Athlon Opteron processor qsub -l nodes=4:cpu.amd
Run my program on any four nodes that have dual processors qsub -l nodes=4:smp
Run my program on any four nodes that have a single Xeon processor qsub -l nodes=4:cpu.xeon+no.smp

Multiple virtual processors per node can be expressed by adding the term ppn=# (for processor per node) to a node expression. For example, to request two virtual processors on each of three nodes:

         qsub -l nodes=3:ppn=2
 
     
How Much Is Argo Being Used?
 

Want to check how much work argo has done? There's are Web pages that summarize usage on argo-new, which include links to personalized info for each user. Info from previous months are also available, with URLs of the form:

    http://www.accc.uic.edu/hardware/argo/YYYYMM.html

or just click here

For example, the July 2005 document is available at www.uic.edu/depts/accc/hardware/argo-new/200507.html

The information on the current month's page is updated every four hours.

 
 

Argo-new Compute Cluster Previous: Available Software Next: MPI


2007-3-7  ACCC Systems Group
UIC Home Page Search UIC Pages Contact UIC