ACCC Home Page ACADEMIC COMPUTING and COMMUNICATIONS CENTER
Accounts / Passwords Email Labs / Classrooms Telecom Network Security Software Computing and Network Services Education / Teaching Getting Help
 

About the New Argo Cluster

 

Argo has been upgraded and the system has some new features you are not familiar with. Please read this web page. Doing so will help you to understand many of the changes to the new system.

 
   
 
     
Some background
  Argo is not a single machine; it is composed of many machines. There are three classifications of machines in the argo system:
  • a master computer
  • over fifty servers called compute nodes, and
  • two filesystem servers.
The master is the machine you login to. When you ssh to argo, the master prompts you for your netid and password. Once authenticated, you're logged into the master where you may:
  • process your input and output files,
  • compile/link (create) executables,
  • upload and download files
as well as a host of other activities. What you don't do on the master is run your programs. This bears repeating:
    YOU DON'T RUN SOFTWARE OR YOUR PROGRAMS on the MASTER.
If you do something like this:
    ./a.out
the system will send email to the system administrator who will, in turn, send you email saying "DON'T DO THAT!!!!!!!" If you persist in violating the policy, your account will be suspended or terminated.

Instead, what you do on the master is submit (schedule) a program for execution on one or more of the compute nodes. Of course, you may use editors and compilers. But, the master is where prepare and submit your program (or software) for execution and not where your program (software) actually runs.

Currently, there are fifty-six compute nodes. The sole purpose of a compute node is to run stuff for clients. However, you can't login to compute nodes and you have no need to do so. Compute nodes just run programs; that's all they do.

To review, logins are on the master. Executing a program is done on compute nodes. If you run your program on the master, it defeats the point of having compute nodes. When user programs execute on the master, things get sluggish for other argo users on the master. Logins are slower. Editing sessions are slower. Command execution is slower.

The third and final classification is filesystem server. There are two filesystem servers: one that gives you your home space; the other, your scratch space. Your home space is not on a disk local to the master. Rather, it's on disks local to another machine and those disks are made available to the master. Your home space is made available to every compute node. The same is true for scratch space.

To see the location of your home directory, enter the following command from the argo command line:

    echo $HOME
Besides your home directory, you have a directory in the /scratch filesystem:
    /scratch/netid
To access your scratch directory, execute the following command on argo:
    cd /scratch/netid

Delete old and unnecessary files from both your home and scratch spaces. Doing so will improve your interaction with the system. Many a user has complained about the poor performance of the ls (list) command. Those users accumulated hundreds of old and, unnecessary files, slowing the performance of the ls and other shell commands. If a file is important, download it to your desktop or laptop and erase it from your home space. To find and delete files older (last modified) than a certain date, use the following:

    find . -mtime +xxxx -type f | xargs rm
where xxxx is the number of days ago. Examples:
  • Delete files that are seven days or older
      find . -mtime +7 -type f | xargs rm

  • Delete files that are a year or older
      find . -mtime +365 -type f | xargs rm

Quotas are deployed in your home space. You are limited to how much space you can use.

 
     
How do I get to argo?
 

An ssh client with or without a VPN is the only way to get to argo (telnet is not supported). If you use the UIC/ACCC-supplied VPN, you may ssh directly to argo from on or off campus locations. The ACCC VPN software is extremely easy both to install and to use. Installation onto either your desktop or your laptop can be completed in under a minute and it's already configured for you. Or, you may install some other VPN-acquired software if you are so inclined. The ACCC VPN can be downloaded from the following URL:

You will be prompted for your netid and your common password. There are versions available for Windows, Linux, and the MAC.

If you elect not to use a VPN, then you may still login to argo. If you are on campus, you may ssh directly to the system. If you are off campus, then there is a double login. First, you must ssh to tigger or icarus. Faculty and staff login to tigger and students login to icarus. Then, from one of those machines, you ssh to argo using the following command line command:

    ssh -l netid argo
 
     
How do I get an ssh client?
 

There are a number of ssh clients. Two of the more popular are putty and SecureCRT. If you use your favorite search engine and enter the words SSH, putty and download, you will find the appropriate web site from which to get it. SecureCRT may be downloaded from the University of Illinois Software Webstore:

 
     
Login information
 

Login id: netid Password: (see following paragraph)

Your common password is your argo password. To change your common password, you use the Common Password Utility at:

Changing your common password changes it for all ACCC servers and services and not just for argo. That's why it's called your common password.
 
     
How do I run a program?
 

You submit your executable program (user-written or vendor-supplied software) to the torque system and it will run the program on one of the compute nodes. Torque is a networked subsystem for submitting, monitoring, and controlling a workload of jobs on the cluster. It is not a batch system; interactive jobs such as ANSYS and GaussView may be run in it.

Below is a sample text file, called my_script, that contains some basic directives (commands) to torque:

    #!/bin/csh
    #PBS -m be
    #PBS -e ***homedir***/a.ou1t.error
    #PBS -o ***homedir***/a.out.output
    #PBS -N a.out
    ***homedir***/a.out
In the above example, the ***homedir*** is replaced with the fully-qualified path to your home directory. More about this in a moment.

You submit the text file to torque using the qsub command:
    qsub my_script
Immediately upon submitting a job for execution, the job is assigned a number, a jobid, which identifies the job and is used with other commands to, among other things, monitor the job status (is my job running, is it awaiting execution, etc).

You are not limited to just what you see in the sample script. Other commands are available including but not limited to shell commands like cd, ls, grep, and rm. The sample has some basic directives -- to get you started.

Each line that has a torque directive must start with #PBS. Shell commands, on the other hand, do not.

More about the qsub command later. Let's look at the script.

The first line, #!/bin/csh, tells torque to run the script under the C-shell. You are not required to use the C-shell. If you want to run your script under the bash shell, replace the line with #!/bin/bash. The line is not a torque directive, unlike the next four lines, and does not start with #PBS.

The second line, #PBS -m be, is a directive (it starts with #PBS) and it directs torque to inform you via email when your job starts and when it ends. This line is not required; it may be altered or removed entirely. Here are two other permutations:

    #PBS -m b
    #PBS -m e
The first informs you when the job starts but not when it ends; the second, inform you when the job ends but not when it starts.

Just because you've submitted your job for execution doesn't mean it will run immediately. Your job may wait until requested resources become available. Or, it may never run because you requested too many resources and/or resources that will never be available to you. That's why the email, letting you know when your program begins executing, is useful. If you submit a job that will run for an extended period of time (days or weeks), it's convenient to be informed of its completion. The email is not sent to a maildrop on argo but rather to your UIC maildrop (netid@uic.edu). To change the email destination, modify the content of the .forward file in your home directory. If you don't have a .forward file, create one with your maildrop on a single line (netid@uic.edu). Example:

    jsmith@uic.edu

The third line, #PBS -e ***homedir***/a.out.error, is not required. It tells torque to use the a.out.error file as the standard error. If you remove this line, then a default naming scheme is used. The name of the default is constructed from the following information:

    jobname
    .e
    jobid
For example, if you submit the a.out job and the job is assigned the job number 3603, then the error file is a.out.e3603. The advantage of the default is that each new job submission creates a new error file and does not replace an existing one. The downside - each new submission creates a new error file, cluttering your home directory with hundreds of them. As you accumulate too many, your response time when you use the shell commands like ls or rm is slow at best. You are STRONGLY encouraged to erase error files after reviewing the content.

The ***homedir*** in the script must be replaced with the fully-qualified path to your home directory. To determine the path, do the following from the argo command line:
    echo $HOME
The output from the echo statement is what you should use in lieu of ***homedir***. For example:
    echo $HOME
    /home/homes51/jsmith
So, for user jsmith, the line should be:
    #PBS -e /home/homes51/jsmith/a.out.error

The fourth line, #PBS -o ***homedir***/a.out.output, is for standard out. The same logic that applied to standard error applies to standard out. If you use the default, then a new file is created with each submission. The standard output file has the following format:

    jobname
    .o
    jobid

The fifth line, #PBS -N a.out, assigns the name a.out to the job. Assigning a name is not required and may be removed.

The sixth line, ***homedir***/a.out, is the program to execute. Obviously, this is required. Without the line, your script does nothing. You are strongly encouraged to use the full path to a file. Samples:

    /home/homes51/jsmith/a_program
    /home/homes52/tjones/my_second_program
    /home/homes53/bgarvin/another_program

The following is a BAD way to do identify the location of your executable and will, most likely, cause you problems:

    ./a.out

If you write your own programs, it is best to include the full path to your files in open statements. Or, include the cd command in your script. Example:

    #!/bin/csh
    cd ***homedir***/tmp
    ***homedir***/tmp/a.out
Files opened by the a.out program will be written to the current directory which is ***homedir***/tmp.

The command to submit your script file to torque is qsub. If you use the executable a.out as the operand to qsub, you'll get something like the following message in your standard error file:

    -bash: /var/spool/PBS/mom_priv/jobs/3629.argo.c.SC: cannot execute binary file
The operand to the qsub command must always be a text file and not a binary, executable program:
    Wrong: qsub a.out
The text file may be complex with hundreds of directives or it may be nothing more than the path and name of your executable:
    /home/homes51/jsmith/my_executable
Your argo account has what are called environmental variables. To get a list of them, type the env command at the argo command prompt:
    env
You may use environmental variables in your script. For example, the path to your home directory is contained in the variable $HOME. Instead of hardcoding the path to your home, you may substitute the variable $HOME:
    $HOME/my_executable
      instead of
    /home/homes50/jsmith/my_executable
However, if you use environment variables, you must inform torque that you are doing so. The way to do that is use the -V option:
    qsub -V script
Best practice: always include the -V option.
 
     
Queues
  Queues are locations for jobs, are composed of compute nodes, and have rules regarding who may use them and for how long. Jobs are submitted to queues (via the qsub command) and the system decides which node or nodes assigned to the queue runs the job. Queues are aptly named; some jobs queue execute immediately (running) while others wait to execute (queued) until the requested resource(s) in the queue become available.

There are five queues: three only for jobs submitted by students and two only for the jobs submitted by staff/faculty.

  • student_short
  • student_medium
  • student_long
  • staff
  • dedicated
There are different resources (compute nodes are one type of resource) and policies for each queue: who may use the queue, the number of jobs a user may run in a queue, how long the job may run (anther type of resource), etc.

The queues whose names include the word student are for students ONLY. Conversely, the staff and dedicated queues are restricted to staff and faculty.

A job submitted by a student to the staff queue will be denied access and instead be routed to the student_short queue. Conversely, a job submitted by staff/faculty will not run on any of the student queues and be re-routed to the staff queue.

Regarding how long a job may execute in a queue, a job submitted to the student short queue will be terminated by the system after four hours whereas a job submitted to the staff queue has a default runtime of 72 hours. Currently when you login to argo, a chart with both the default times and maximum times is displayed. Later, you will be given commands to extract the information from the system.

 
     
Submitting a job for execution
  The command for submitting a job on the cluster is:
    qsub -V my_script
Since you didn't specify a queue, the system will route the job to a default queue. For a staff member, the default is the staff queue. For students, the default is student_short. To specify a queue, you have two options:
  • Include the queue name as an operand to the qsub command:
      qsub -V -q staff my_script
  •   OR
  • Put the queue name in the script file (my_script):
      #PBS -q staff

    and submit the script without the queue name on the qsub line:
      qsub -V my_script
More examples:
    qsub -V -q student_short my_script
    qsub -V -q student_medium my_script
    qsub -V -q student_long my_script
    qsub -V -q dedicated my_script
There is a second queue for staff (not for students); it is called dedicated. The advantage of dedicated is that you will have exclusive use of a single, dual core, dual cpu machine. Your job will be the only one running on the node assigned to the queue (unlike nodes in any other queue where your running job shares the node with other running jobs). The downside - you get it for only 30 minutes. To use the dedicated queue, just substitute the word dedicated for staff:
    qsub -V -q dedicated my_script
or
    #PBS -q dedicated

    qsub -V my_script
 
     
Using more than one machine/processor
  The number of nodes you request to run your job is specified, like the destination, either on the qsub line or in the script. (FYI: the number of requested nodes is a resource.) If you are running a serial job (a serial job uses one node and one core on it), which is the default, you don't have to request one node. The following request for a single node is unnecessary:
    qsub -V -l nodes=1 -q staff my_script
It's the same as doing:
    qsub -V -q staff my_script
If you will run a parallel job using multiple nodes, then you must request (hardcode) the number nodes you want allocated:
  • Put the number of nodes on the qsub line:
      qsub -V -l nodes=4 -q staff my_script
  •   OR
  • Put the number of nodes in the script file:
      #PBS -l nodes=4

    and submit the script without the number of nodes on the qsub line:
      qsub -V -q staff my_script
In both cases, the user is requesting four nodes.

You want multiple cores. Cores are indicated by the ppn operand that follows the request for the number of nodes. Same as before: you may use one of the two formats to make the request. For the purposes of brevity, I will show only first method; you should be able to construct the second method from previous examples.

  • Use 3 nodes but only one core per node:
      qsub -V -l nodes=3:ppn=1 -q staff my_script
  • Use 3 nodes and two core per node:
      qsub -V -l nodes=3:ppn=2 -q staff my_script
  • Use 1 node and two cores on it:
      qsub -V -l nodes=1:ppn=2 -q staff my_script
As stated before, one core is the default. If you don't specify the number of cores, then only one is assigned. The following two command invocations do the same thing; both request one core (the first includes the request; the second, lets the system default to one core):
    qsub -V -l nodes=1:ppn=1 -q staff my_script
is the same as
    qsub -V -l nodes=1 -q staff my_script
The requested cores will be on the same machine. If you request one machine with four cores, the system will look for a single machine having four cores. If it fails to find a matching machine, it will NOT attempt to allocate two machines and two cores on each of the two machines. Instead, your job does not run (queues) because the system will be unable to satisfy your request. The resources (not just cores but other requested resources including but not limited to nodes) may be owned by another job and, when that job completes and the resource(s) become available, they will be assigned to your job. However, if you request resources that will never be available (I want eight cores on one node is an example of a nonexistent resource), your job will never run.

A VERY IMPORTANT POINT

If your program or software is not parallel or distributive, specifying more than one node and/or more than one processor DOES ABOSLUTELY NOTHING other than waste resources that could be used by some other client.

How do I know if my program or software is parallel or distributive?
    If you've written your own program using a programming language like C or Fortran (or any other language) and you haven't included a paradigm such as MPI, your program is neither parallel nor distributive. If you are not familiar with the terms MPI, OpenMPI, MPICH2, or LAM or if you don't know what parallel or distributive mean, then your program is not parallel/distributive. If your software doesn't specifically mention that it parallel/distributive, it is not. Gaussian is an example of a parallel/distributive package; GaussView is not. If your program is not parallel/distributive, then it is SERIAL. That means it execute on a single processor on a single node.
If the software or program is SERIAL, do not include the nodes and ppn components on the qsub command. Including them will NOT (REPEAT, WILL NOT) make your program run faster.
 
     
What are the limits regarding multiple processors and machines?
  Currently, there is a chart that is displayed when you login. If you failed to notice the chart or if it scrolled off the screen, you may display it along with other login messages by using the following argo command:
    cat /etc/motd
At some point, that chart will be removed from message of the day file (motd). There are commands that you can use to display queue limits:
  • Maximum number of nodes a student may use in the student_short queue:
      qmgr -c "list queue student_short" | grep resources_max.nodect
  • Maximum number of processors a student may request for the job:
      qmgr -c "list queue student_short" | grep resources_max.ncpus
  • To see other policies (rules) of the student_short queue:
      qmgr -c "list queue student_short"
The samples use the student_short queue. You may substitute the name of other queues (student_medium, student_long, staff, dedicated) to get the corresponding settings for them:
  • qmgr -c "list queue staff" | grep resources_max.nodect
  • qmgr -c "list queue staff" | grep resources_max.ncpus
  • qmgr -c "list queue staff"
 
     
How long can my job run?
 

Queues have two values regarding how long a job may run: a default and a maximum. The default is used when a user does not specify how long to run the job, called walltime (more about this in a moment). The maximum is exactly that: regardless of how much time a user requests, a job is permitted to run no longer than the maximum. The maximums (per queue) are shown in the output of the qstat -q command (the walltime column):

    qstat -q
    Queue            Memory CPU Time Walltime Node  Run Que Lm  State
    ---------------- ------ -------- -------- ----  --- --- --  -----
    batch              --      --       --      --    0   0 --   E R
    staff              --      --    720:00:0    12   6   0 --   E R
    student_long       --      --    240:00:0     4   1   0 16   E R
    student_short      --      --    04:00:00     4   0   0 10   E R
    dedicated          --      --    00:30:00     1   0   0  1   E R
    student_medium     --      --    24:00:00     4   0   0 10   E R
    
The defaults are show using the qmgr command:
    qmgr -c "list queue student_short" | grep resources_default.walltime
    qmgr -c "list queue student_medium" | grep resources_default.walltime
    qmgr -c "list queue student_long" | grep resources_default.walltime
    qmgr -c "list queue staff" | grep resources_default.walltime
    qmgr -c "list queue dedicated" | grep resources_default.walltime
Walltime is specified, like the queue, as either an operand on the qsub command or as a value in the text file. In your first run, let the system use the default. If you are running an interactive job, specify a value considerably less since you know how long you plan to interact with the software. Walltime MUST be requested using the format HHH:MM:SS and the system interprets the values from right to left. Examples:
  • Put the walltime on the qsub line:
      qsub -V -l nodes=1,walltime=720:00:00 -q staff my_script
  • 2nd way to put walltime on the qsub line:
      qsub -V -l nodes=1 -l walltime=720:00:00 -q staff my_script

    In the first example, a comma separates the two options following the lowercase -L. In the second example, there is no comma; each of the two options is denoted by its own lowercase -L.

  • Put the walltime in the script file:
      #PBS -l walltime=720:00:00

    and submit the script without the walltime on the qsub invocation:
      qsub -V -l nodes=1 -q staff my_script
There is a gotcha regarding walltime. As was stated, the system reads the walltime specification from right to left and all three fields must be included even if just to contain zeroes. As an example, the following format does not request 720 hours; instead, it asks for only 12 hours:
    walltime=720:00
The first field (00) is the number of seconds requested and the second field (720) is the number of minutes. 720 minutes (divided by 60 minutes per hour) translates to 12 hours. To request 720 hours:
    walltime=720:00:00
There is no command to tell you how long a job will run (in other words, what to enter as the walltime). You just have to estimate.
 
     
Monitoring my job
  Clients assume that a submitted job immediately starts running. As stated earlier, a job may not run and will queue (wait) if the requested resources are unavailable or if the request exceeds a defined limit for the user or for the queue. The qstat command tells you if the job is running, where it is running, and for how long. It will also tell you if the job is not running.

Sample output #1:
   Job id  Name      User      Time Used S Queue
   ------  ----   ---------    --------  - ----- 
   1232    my_job1   jsmith    192:34:2  R staff
   1233    my_job2   jsmith    192:31:5  R staff
   1252    my_job3   bob1             0  Q student_long
The two jobs owned by user jsmith are running, so indicated by the R (for running) in the S (Status) column. The job, my_job3 owned by the student bob1, is queued, so indicated by the Q in the status column. Sample output #2 (edited somewhat to fit the document):
JobID Username  Queue    Jobname  SessID NDS TSK Time  S Time
----- -------   -------- -------- ------ --- --- ----- - -----
1232  jsmith    staff    my_job1  31047  1   1   720:0 R 193:0
1233  jsmith    staff    my_job2  28986  1   1   720:0 R 192:5
1252  bob1      student_ my_job3     --  4   1   240:0 Q   --  
Sample output #3:
Job ID  Username Queue  Jobname  SessID NDS TSK Memory Time  S Time
------  -------- ------ -------- ------ --- --- ------ ---- -- ----
1232    jsmith   staff  my_job1  31047   1   1    --   720:0 R 193:1
argo2-4/0
Some permutations of the qstat command:
  • Show me all jobs in the system (a) and the nodes allocated to the job (n):
      qstat -an
  • Show me all of my jobs (u netid) submitted to the system (running or queued):
      qstat -u netid
  • Show me information regarding a particular job (jobid):
      qstat -f jobid
 
     
Killing jobs
  To kill a single job (running or queued), use the qdel command. The operand to the command is the jobid. Sample:
    qdel 12345
To kill all of my jobs, use the following commands:
    qselect -u netid | xargs qdel
The system prevents you from killing jobs that don't belong to you.
 
     
My job is not running and remains in a "queued state"
  The two most likely causes of a job not running are:
  1. Requesting too many resources, and
  2. Requesting resources that are not available.
Example 1:
    qsub -V -l nodes=5:ppn=2 -q student_short my_script
The request violates two policies. One, the user is requesting a total of ten processors - two (ppn=2) on five nodes (nodes=5). The job is headed to the student_short queue. As was explained previously, the maximum number of processors (ncpus) a job may use on the student_short queue is eight (the following command with the resulting answer tells you that) but the request is for ten:
    qmgr -c "list queue student_short" | grep resources_max.ncpus
      resources_max.ncpus = 8
Two, the job is also requesting more nodes (max.nodect) than is permitted (the user wants five nodes when four is the maximum):
    qmgr -c "list queue student_short" | grep resources_max.nodect
      resources_max.nodect = 4
The following message should have appeared after issuing the qsub command:
    qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodect requirement
The words "max nodect" is key. Translation: you exceeded the maximum number of nodes permitted.

Example 2 (a student issues the following command):
    qsub -V -l node=argo1-1 my_script
The user requests a particular node to run a job. Since the user is a student and did not identify a queue, the job, by default, is routed to the student_short queue. But, argo1-1, is not assigned to the student_short queue. The job is requesting a resource not owned by the queue and is, therefore, unavailable. The job does not run. Users should NEVER request a node by name. You ask for X number of nodes and X processors per node and not a particular node.

  • WRONG: qsub -V -l nodes=argo1-1+argo1-2 my_script
  • RIGHT:    qsub -V -l nodes=2 my_script

  • WRONG: qsub -V -l nodes=argo1-1:ppn=2+argo1-2:ppn=2 my_script
  • RIGHT:    qsub -V -l nodes=2:ppn=2 my_script
Example 3:

It is important to note that the maximum number of nodes and CPUs is cumulative across all your submitted jobs.

    qstat -u jsmith1
    JobID Username Queue    Jobname    SessID NDS TSK Memory Time S Time
    ----- -------- -----    ---------  ------ --- --- ------ ---- - ----
    1234  jsmith1  student_ my_script  12345  3   1    --  04:00  R 03:41
    argo13-2/1+argo13-2/0+argo7-4/1+argo7-4/0+argo7-3/1+argo7-3 
    1235  jsmith1  student_ my_script    --   3   1    --  04:00  Q  --
    --
    
Why is the first job (id 1234) "running" - indicated by a R (for running) in the status (S) column as well as the names of the nodes assigned to the job - and the second job (id 1235) is queued, awaiting execution?

Both jobs were submitted by the student jsmith1 for execution on the student_short queue, the second jobs soon after the first. And, both were submitted using the following qsub command invocation:

    qsub -l nodes=3:ppn=2 -q student_short my_script
The first job requests six CPUs: two CPUs (ppn=2) on three nodes (nodes=3). The maximum number of CPUs a student may use on the student_short queue is eight:
    qmgr -c "list queue student_short" | grep resources_max.ncpus
      resources_max.ncpus = 8
Since jsmith had no other resources, the first job is assigned the six CPUs and begins execution. The second job also requests six CPUs. The user has six CPUs and requests an additional six for a total of twelve, four over the limit of eight. The second job will not be assigned the requested resources and will sit, queued, awaiting the release of the four CPUs from the first job. However, the release of CPUs is an all or nothing proposition. Therefore, the first job would have to end before the second job begins.

Unlike the previous argo system, users should not request a particular node. A node may be re-assigned from one queue to another depending on system load. There is no guarantee that a particular node remains in a particular queue. Users should direct a job to a queue and not to nodes:

    Wrong: qsub -V -l nodes=argo1-1 my_script
    Right:   qsub -V -q student_long my_script

    Wrong: qsub -V -l nodes=argo1-1+argo1-2 my_script
    Right:   qsub -V -l nodes=2 -q student_long my_script

    Wrong: qsub -V -l nodes=argo1-1:ppn=2+argo1-2:ppn=2 my_script
    Right:   qsub -V -l nodes=2:ppn=2 -q student_long my_script
The two commands that are most useful to diagnose problems pertaining to jobs not running:
  1. tracejob
  2. checkjob
The operand for both commands is a jobid. The output of the checkjob can be very cryptic but the reason why the job is not running is there. For example, suppose a student issues the following command:
    qsub -V -l nodes=4:ppn=3 -q student_short my_script
The job is assigned jobid 1277 but remains queued which is indicated by the capital Q in the S column (status) in the output of the qstat 1277 command:
    Job id  Name       User    Time Use S Queue
    ------  ---------- ----    ---- --- - ------- 
    1277    my_script  jsmith     0     Q student_short
    
If the student issues the command:
    checkjob 1277
the output will include the following (the output has been abbreviated for the purposes of brevity):
    Holds: Batch (hold reason: PolicyViolation)
    Messages: procs too high (12 > 8)
    PE: 12.00 StartPriority: 7
    cannot select job 1277 for partition DEFAULT (job hold active)
Look closely: PolicyViolation: procs too high (12 >8). The student is asking for twelve processors. Go back and take a look at the qsub command:
    qsub -V -l nodes=4:ppn=3 -q student_short my_script
Four nodes multiplied by three processors per node results in twelve processors. But, the student is limited to a total of eight processors on the student_short queue:
    qmgr -c "list queue student_short" | grep resources_max.ncpus
      resources_max.ncpus = 8
Issuing a new qsub command with a slight change (ppn from 3 to 2) will result in a running job:
    qsub -V -l nodes=4:ppn=2 -q student_short my_script
Remember to delete the queued job: qdel 1277
 
     
Requesting multiple nodes/processors for a serial job
  This is an incredible waste of resources. Suppose, for example, the user has an executable (user written or vendor-supplied) that is serial (not a parallel program). If the user issues the following command, the job will execute (immediately if four nodes are available):
    qsub -V -l nodes=4 -q staff my_script
The job will be assigned four nodes but only the first one (in this case, argo4-4) has the single, serial process running.
1228.argo.cc.uic  jsmith staff  my_script 29424 4 1 -- 200:0 R 170:0 
argo4-4/0+argo3-4/0+argo3-3/0+argo3-1/0
The other three nodes (argo3-4, argo3-3, and argo3-1) will be assigned to the job but do nothing other than fill a job slot that some other job might have used.
 
     
How do I see what's going on on a compute node?
  Even though you can't login to a compute node, you will have access to it via the rsh command. For example, suppose you want to see what processes you are running on a particular node. On the master, it's the basic ps command:
    ps -ef | grep netid
For a compute node, you would do the following:
    rsh -l netid compute_node-name "ps -ef | grep netid"
So, for node argo1-1, user jsmith would enter:
    rsh -l jsmith argo1-1 "ps -ef | grep jsmith"
Make sure you enclose the command (what comes after the node - highlighted and underlined in red in the sample below) in double quotes. Sample:
    rsh -l jsmith argo1-1 "ps -ef | grep jsmith"
All the basic commands are available. Suppose user jsmith had used the /tmp filesystem as a scratch area for a temporary file and, upon completion of the job wants to erase it:
rsh -l jsmith argo1-1 ls -al /tmp/junk1
   -rw-r--r--  1 jsmith users 0 May 19 13:31 /tmp/junk1

rsh -l jsmith argo1-1 rm -f  /tmp/junk1
rsh -l jsmith argo1-1 ls -al /tmp/junk1
   ls: /tmp/junk1: No such file or directory
There is also a very nice web-based tool to view the system. To access it, point your browser to the following URL: Notice that you must use https and not http. You will be prompted for your netid and password.
 
     
Processing mail
  Argo SHOULD NOT be used to send or receive mail; that's what your mail account is for. Do not install your own copy of a mail user agent (such as pine or elm). Doing so will result in loss of your account.  
     
Man pages
  All UNIX and Linux-based operating systems have a reference manual of information of each command. To access the manual for a particular command, enter the word man followed by the command. For example, The most commonly-used UNIX command is "ls" which lists your files. To reference the manual for ls, enter:
    man ls
For more information about man, type:
    man man
To exit the online manual hit either the letter q or the Esc key.

That should get you started. If you have any questions, please email systems@uic.edu and include in the subject line the word argo.

There are times (infrequent but it does happen) when one or more nodes crash and you will have to restart your jobs. It's the nature of the system and we are working to reduce such events.

The Management
Academic Computing and Communication Center
University of Illinois at Chicago

 


2009-10-7  ACCC Systems Group
UIC Home Page Search UIC Pages Contact UIC