| ACADEMIC COMPUTING and COMMUNICATIONS CENTER | |||||||||
About the New Argo Cluster | ||
|
Argo has been upgraded and the system has some new features you are not familiar with. Please read this web page. Doing so will help you to understand many of the changes to the new system. |
||
| ||
| Some background | ||
Argo is not a single machine; it is composed of many machines. There
are three classifications of machines in the argo system:
Instead, what you do on the master is submit (schedule) a program for execution on one or more of the compute nodes. Of course, you may use editors and compilers. But, the master is where prepare and submit your program (or software) for execution and not where your program (software) actually runs. Currently, there are fifty-six compute nodes. The sole purpose of a compute node is to run stuff for clients. However, you can't login to compute nodes and you have no need to do so. Compute nodes just run programs; that's all they do. To review, logins are on the master. Executing a program is done on compute nodes. If you run your program on the master, it defeats the point of having compute nodes. When user programs execute on the master, things get sluggish for other argo users on the master. Logins are slower. Editing sessions are slower. Command execution is slower. The third and final classification is filesystem server. There are two filesystem servers: one that gives you your home space; the other, your scratch space. Your home space is not on a disk local to the master. Rather, it's on disks local to another machine and those disks are made available to the master. Your home space is made available to every compute node. The same is true for scratch space. To see the location of your home directory, enter the following command from the argo command line:
Delete old and unnecessary files from both your home and scratch spaces. Doing so will improve your interaction with the system. Many a user has complained about the poor performance of the ls (list) command. Those users accumulated hundreds of old and, unnecessary files, slowing the performance of the ls and other shell commands. If a file is important, download it to your desktop or laptop and erase it from your home space. To find and delete files older (last modified) than a certain date, use the following:
Quotas are deployed in your home space. You are limited to how much space you can use. |
||
| How do I get to argo? | ||
|
An ssh client with or without a VPN is the only way to get to argo (telnet is not supported). If you use the UIC/ACCC-supplied VPN, you may ssh directly to argo from on or off campus locations. The ACCC VPN software is extremely easy both to install and to use. Installation onto either your desktop or your laptop can be completed in under a minute and it's already configured for you. Or, you may install some other VPN-acquired software if you are so inclined. The ACCC VPN can be downloaded from the following URL: You will be prompted for your netid and your common password. There are versions available for Windows, Linux, and the MAC. If you elect not to use a VPN, then you may still login to argo. If you are on campus, you may ssh directly to the system. If you are off campus, then there is a double login. First, you must ssh to tigger or icarus. Faculty and staff login to tigger and students login to icarus. Then, from one of those machines, you ssh to argo using the following command line command:
|
||
| How do I get an ssh client? | ||
|
There are a number of ssh clients. Two of the more popular are putty and SecureCRT. If you use your favorite search engine and enter the words SSH, putty and download, you will find the appropriate web site from which to get it. SecureCRT may be downloaded from the University of Illinois Software Webstore: |
||
| Login information | ||
|
Login id: netid Password: (see following paragraph) Your common password is your argo password. To change your common password, you use the Common Password Utility at: Changing your common password changes it for all ACCC servers and services and not just for argo. That's why it's called your common password. |
||
| How do I run a program? | ||
|
You submit your executable program (user-written or vendor-supplied software) to the torque system and it will run the program on one of the compute nodes. Torque is a networked subsystem for submitting, monitoring, and controlling a workload of jobs on the cluster. It is not a batch system; interactive jobs such as ANSYS and GaussView may be run in it. Below is a sample text file, called my_script, that contains some basic directives (commands) to torque:
#PBS -m be #PBS -e ***homedir***/a.ou1t.error #PBS -o ***homedir***/a.out.output #PBS -N a.out ***homedir***/a.out You submit the text file to torque using the qsub command:
You are not limited to just what you see in the sample script. Other commands are available including but not limited to shell commands like cd, ls, grep, and rm. The sample has some basic directives -- to get you started. Each line that has a torque directive must start with #PBS. Shell commands, on the other hand, do not. More about the qsub command later. Let's look at the script. The first line, #!/bin/csh, tells torque to run the script under the C-shell. You are not required to use the C-shell. If you want to run your script under the bash shell, replace the line with #!/bin/bash. The line is not a torque directive, unlike the next four lines, and does not start with #PBS. The second line, #PBS -m be, is a directive (it starts with #PBS) and it directs torque to inform you via email when your job starts and when it ends. This line is not required; it may be altered or removed entirely. Here are two other permutations:
#PBS -m e Just because you've submitted your job for execution doesn't mean it will run immediately. Your job may wait until requested resources become available. Or, it may never run because you requested too many resources and/or resources that will never be available to you. That's why the email, letting you know when your program begins executing, is useful. If you submit a job that will run for an extended period of time (days or weeks), it's convenient to be informed of its completion. The email is not sent to a maildrop on argo but rather to your UIC maildrop (netid@uic.edu). To change the email destination, modify the content of the .forward file in your home directory. If you don't have a .forward file, create one with your maildrop on a single line (netid@uic.edu). Example:
The third line, #PBS -e ***homedir***/a.out.error, is not required. It tells torque to use the a.out.error file as the standard error. If you remove this line, then a default naming scheme is used. The name of the default is constructed from the following information:
.e jobid The ***homedir*** in the script must be replaced with the fully-qualified path to your home directory. To determine the path, do the following from the argo command line:
/home/homes51/jsmith
The fourth line, #PBS -o ***homedir***/a.out.output, is for standard out. The same logic that applied to standard error applies to standard out. If you use the default, then a new file is created with each submission. The standard output file has the following format:
.o jobid The fifth line, #PBS -N a.out, assigns the name a.out to the job. Assigning a name is not required and may be removed. The sixth line, ***homedir***/a.out, is the program to execute. Obviously, this is required. Without the line, your script does nothing. You are strongly encouraged to use the full path to a file. Samples:
/home/homes52/tjones/my_second_program /home/homes53/bgarvin/another_program The following is a BAD way to do identify the location of your executable and will, most likely, cause you problems:
If you write your own programs, it is best to include the full path to your files in open statements. Or, include the cd command in your script. Example:
cd ***homedir***/tmp ***homedir***/tmp/a.out The command to submit your script file to torque is qsub. If you use the executable a.out as the operand to qsub, you'll get something like the following message in your standard error file:
|
||
| Queues | ||
|
Queues are locations for jobs, are composed of compute nodes,
and have rules regarding who may use them and for how long. Jobs
are submitted to queues (via the qsub command) and the system
decides which node or nodes assigned to the queue runs the job.
Queues are aptly named; some jobs queue execute immediately
(running) while others wait to execute (queued) until the
requested resource(s) in the queue become available.
There are five queues: three only for jobs submitted by students and two only for the jobs submitted by staff/faculty.
The queues whose names include the word student are for students ONLY. Conversely, the staff and dedicated queues are restricted to staff and faculty. A job submitted by a student to the staff queue will be denied access and instead be routed to the student_short queue. Conversely, a job submitted by staff/faculty will not run on any of the student queues and be re-routed to the staff queue. Regarding how long a job may execute in a queue, a job submitted to the student short queue will be terminated by the system after four hours whereas a job submitted to the staff queue has a default runtime of 72 hours. Currently when you login to argo, a chart with both the default times and maximum times is displayed. Later, you will be given commands to extract the information from the system. |
||
| Submitting a job for execution | ||
The command for submitting a job on the cluster is:
OR
and submit the script without the queue name on the qsub line:
qsub -V -q student_medium my_script qsub -V -q student_long my_script qsub -V -q dedicated my_script
qsub -V my_script |
||
| Using more than one machine/processor | ||
The number of nodes you request to run your job is specified,
like the destination, either on the qsub line or in the script.
(FYI: the number of requested nodes is a resource.) If you are
running a serial job (a serial job uses one node and one core
on it), which is the default, you don't have to request one
node. The following request for a single node is unnecessary:
OR
and submit the script without the number of nodes on the qsub line:
You want multiple cores. Cores are indicated by the ppn operand that follows the request for the number of nodes. Same as before: you may use one of the two formats to make the request. For the purposes of brevity, I will show only first method; you should be able to construct the second method from previous examples.
A VERY IMPORTANT POINT If your program or software is not parallel or distributive, specifying more than one node and/or more than one processor DOES ABOSLUTELY NOTHING other than waste resources that could be used by some other client. How do I know if my program or software is parallel or distributive?
|
||
| What are the limits regarding multiple processors and machines? | ||
Currently, there is a chart that is displayed when you login.
If you failed to notice the chart or if it scrolled off the
screen, you may display it along with other login messages
by using the following argo command:
|
||
| How long can my job run? | ||
|
Queues have two values regarding how long a job may run: a default and a maximum. The default is used when a user does not specify how long to run the job, called walltime (more about this in a moment). The maximum is exactly that: regardless of how much time a user requests, a job is permitted to run no longer than the maximum. The maximums (per queue) are shown in the output of the qstat -q command (the walltime column):
Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- batch -- -- -- -- 0 0 -- E R staff -- -- 720:00:0 12 6 0 -- E R student_long -- -- 240:00:0 4 1 0 16 E R student_short -- -- 04:00:00 4 0 0 10 E R dedicated -- -- 00:30:00 1 0 0 1 E R student_medium -- -- 24:00:00 4 0 0 10 E RThe defaults are show using the qmgr command:
|
||
| Monitoring my job | ||
|
Clients assume that a submitted job immediately starts running.
As stated earlier, a job may not run and will queue (wait)
if the requested resources are unavailable or if the request
exceeds a defined limit for the user or for the queue. The
qstat command tells you if the job is running, where it is
running, and for how long. It will also tell you if the job
is not running. Sample output #1: Job id Name User Time Used S Queue ------ ---- --------- -------- - ----- 1232 my_job1 jsmith 192:34:2 R staff 1233 my_job2 jsmith 192:31:5 R staff 1252 my_job3 bob1 0 Q student_longThe two jobs owned by user jsmith are running, so indicated by the R (for running) in the S (Status) column. The job, my_job3 owned by the student bob1, is queued, so indicated by the Q in the status column. Sample output #2 (edited somewhat to fit the document): JobID Username Queue Jobname SessID NDS TSK Time S Time ----- ------- -------- -------- ------ --- --- ----- - ----- 1232 jsmith staff my_job1 31047 1 1 720:0 R 193:0 1233 jsmith staff my_job2 28986 1 1 720:0 R 192:5 1252 bob1 student_ my_job3 -- 4 1 240:0 Q --Sample output #3: Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ------ -------- ------ -------- ------ --- --- ------ ---- -- ---- 1232 jsmith staff my_job1 31047 1 1 -- 720:0 R 193:1 argo2-4/0Some permutations of the qstat command:
|
||
| Killing jobs | ||
To kill a single job (running or queued), use the qdel command. The operand to the command is the jobid.
Sample:
|
||
| My job is not running and remains in a "queued state" | ||
The two most likely causes of a job not running are:
Example 2 (a student issues the following command):
It is important to note that the maximum number of nodes and CPUs is cumulative across all your submitted jobs.
JobID Username Queue Jobname SessID NDS TSK Memory Time S Time ----- -------- ----- --------- ------ --- --- ------ ---- - ---- 1234 jsmith1 student_ my_script 12345 3 1 -- 04:00 R 03:41 argo13-2/1+argo13-2/0+argo7-4/1+argo7-4/0+argo7-3/1+argo7-3 1235 jsmith1 student_ my_script -- 3 1 -- 04:00 Q -- --Why is the first job (id 1234) "running" - indicated by a R (for running) in the status (S) column as well as the names of the nodes assigned to the job - and the second job (id 1235) is queued, awaiting execution? Both jobs were submitted by the student jsmith1 for execution on the student_short queue, the second jobs soon after the first. And, both were submitted using the following qsub command invocation:
Unlike the previous argo system, users should not request a particular node. A node may be re-assigned from one queue to another depending on system load. There is no guarantee that a particular node remains in a particular queue. Users should direct a job to a queue and not to nodes:
Right: qsub -V -q student_long my_script Wrong: qsub -V -l nodes=argo1-1+argo1-2 my_script Right: qsub -V -l nodes=2 -q student_long my_script Wrong: qsub -V -l nodes=argo1-1:ppn=2+argo1-2:ppn=2 my_script Right: qsub -V -l nodes=2:ppn=2 -q student_long my_script
Job id Name User Time Use S Queue ------ ---------- ---- ---- --- - ------- 1277 my_script jsmith 0 Q student_shortIf the student issues the command:
Messages: procs too high (12 > 8) PE: 12.00 StartPriority: 7 cannot select job 1277 for partition DEFAULT (job hold active)
|
||
| Requesting multiple nodes/processors for a serial job | ||
This is an incredible waste of resources. Suppose, for
example, the user has an executable (user written or
vendor-supplied) that is serial (not a parallel program).
If the user issues the following command, the job will
execute (immediately if four nodes are available):
1228.argo.cc.uic jsmith staff my_script 29424 4 1 -- 200:0 R 170:0 argo4-4/0+argo3-4/0+argo3-3/0+argo3-1/0The other three nodes (argo3-4, argo3-3, and argo3-1) will be assigned to the job but do nothing other than fill a job slot that some other job might have used. |
||
| How do I see what's going on on a compute node? | ||
Even though you can't login to a compute node, you will have
access to it via the rsh command. For example, suppose you want
to see what processes you are running on a particular node. On
the master, it's the basic ps command:
rsh -l jsmith argo1-1 ls -al /tmp/junk1 -rw-r--r-- 1 jsmith users 0 May 19 13:31 /tmp/junk1 rsh -l jsmith argo1-1 rm -f /tmp/junk1 rsh -l jsmith argo1-1 ls -al /tmp/junk1 ls: /tmp/junk1: No such file or directoryThere is also a very nice web-based tool to view the system. To access it, point your browser to the following URL: Notice that you must use https and not http. You will be prompted for your netid and password. |
||
| Processing mail | ||
| Argo SHOULD NOT be used to send or receive mail; that's what your mail account is for. Do not install your own copy of a mail user agent (such as pine or elm). Doing so will result in loss of your account. | ||
| Man pages | ||
All UNIX and Linux-based operating systems have a reference manual
of information of each command. To access the manual for a particular
command, enter the word man followed by the command. For example,
The most commonly-used UNIX command is "ls" which lists your
files. To reference the manual for ls, enter:
That should get you started. If you have any questions, please email systems@uic.edu and include in the subject line the word argo.
There are times (infrequent but it does happen) when one or
more nodes crash and you will have to restart your jobs. It's
the nature of the system and we are working to reduce such
events. |
||
| 2009-10-7 ACCC Systems Group |
|