| ACADEMIC COMPUTING and COMMUNICATIONS CENTER | |||||||||
The qsub command | ||
| Overview | ||
|
All user jobs are run by
torque. |
||
| Syntax | ||
|
The format of the qsub command is:
where script_file is a text file containing, among other things:
Repeat: the operand to the qsub command MAY NOT BE an executable program (a binary file).
Doing so will result in your program not running. Your job will be assigned a job id and will be APPEAR if you immediately do a qstat command: Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 2256.argo-new.c jsmith staff a.out 6427 64 -- -- 336:0 R -- argo4-4/0Looks like it's running - the status column (S) contains an R for running. But, after a minute or so, the job will cancel and a message like the following will be in your standard error file:
|
||
| More Detailed information about script files and qsub | ||
Below is a sample text file, called my_script, that contains some basic directives (commands) to torque:
You submit the text file to torque using the qsub command:
Immediately upon submitting a job for execution, the job is assigned a number, a jobid, which identifies the job and is used with other commands to, among other things, monitor the job status (is my job running, is it awaiting execution, etc). You are not limited to just what you see in the sample script. Other commands are available including but not limited to shell commands like cd, i ls, grep, and rm. The sample has some basic directives -- to get you started. Each line that has a torque directive must start with #PBS. Shell commands, on the other hand, do not. Let's take a detailed look at the script. The first line, #!/bin/csh, tells torque to run the script under the C-shell. You are not required to use the C-shell. If you want to run your script under the bash shell, replace the line with #!/bin/bash. The line is not a torque directive, unlike the next four lines, and does not start with #PBS. The second line, #PBS -m be, is a directive (it starts with #PBS) and it directs torque to inform you via email when your job starts and when it ends. This line is not required; it may be altered or removed entirely. Here are two other permutations:
The first informs you when the job starts but not when it ends;
the second, inform you when the job ends but not when it starts.
Just because you've submitted your job for execution doesn't mean
it will run immediately. Your job may wait until requested
resources become available. Or, it may never run because you
requested too many resources and/or resources that will never be
available to you. That's why the email, letting you know when your
program begins executing, is useful. If you submit a job that will
run for an extended period of time (days or weeks), it's
convenient to be informed of its completion. The email is not
sent to a maildrop on argo but rather to your UIC maildrop
(***netid***@uic.edu). To change the email destination, modify
the content of the .forward file in your home directory. If you
don't have a .forward file, create one with your maildrop on a
single line (netid@uic.edu). Example:
The third line, #PBS -e ***homedir***/a.out.error, is not
required. It tells torque to use the a.out.error file as the
standard error. If you remove this line, then a default naming
scheme is used. The name of the default is constructed from the
following information:
For example, if you submit the a.out job and the job is assigned the job number 3603, then the error file is a.out.e3603. The advantage of the default is that each new job submission creates a new error file and does not replace an existing one. The downside - each new submission creates a new error file, cluttering your home directory with hundreds of them. As you accumulate too many, your response time when you use the shell commands like ls or rm is slow at best. You are STRONGLY encouraged to erase error files after reviewing the content. The fourth line, #PBS -o ***homedir***/a.out.output, is for standard out. The same logic that applied to standard error applies to standard out. If you use the default, then a new file is created with each submission. The standard output file has the following format:
The fifth line, #PBS -N a.out, assigns the name a.out to the job. Assigning a name is not required and may be removed. The sixth line, ***homedir***/a.out, is the program to
execute. Obviously, this is required. Without the line, your
script does nothing. You are strongly encouraged to use the
full path to a file. Sample:
The following is not a good way to do identify the location
of your executable and will, most likely, cause you problems:
If you write your own programs, it is best to include the
full path to your files in open statements. Or, include
the cd command in your script. Example:
Files opened by the a.out program will be written to the current directory which is ***homedir***/tmp. You may use environmental variables in your script. For example, the path to your home directory is contained in the variable $HOME. Instead of hardcoding the path to your home, you may substitute the variable $HOME:
However, if you use environment variables, you must inform
torque that you are doing so. The way to do that is use the
-V option with qsub:
Best practice: always include the -V option. |
||
| Identifying a queue to run your job | ||
The command for submitting a job on the cluster is:
To specify a queue, you have two options:
OR
and submit the script without the queue name on the qsub line:
qsub -V -q student_medium my_script qsub -V -q student_long my_script qsub -V -q dedicated my_script
qsub -V my_script |
||
| Using more than one machine/processor | ||
The number of nodes you request to run your job is specified,
like the destination, either on the qsub line or in the script.
(FYI: the number of requested nodes is a resource.) If you are
running a serial job (a serial job uses one node and one core
on it), which is the default, you don't have to request one
node. The following request for a single node is unnecessary:
It's the same as doing:
If you will be running a parallel job using multiple nodes,
then you must request (hardcode) the number nodes you want
allocated:
In both cases, the user is requesting four nodes. You should not request a particular node. A node may be moved from one queue to another depending on system load. There is no guarantee that a particular node remains in a particular queue. Users should direct a job to a queue and not to nodes: Wrong: qsub -V -l nodes=argo1-1 my_script Right: qsub -V -1 nodes=1 -q student_long my_script Wrong: qsub -V -l nodes=argo1-1+argo1-2 my_script Right: qsub -V -l nodes=2 -q student_long my_script Wrong: qsub -V -l nodes=argo1-1:ppn=2+argo1-2:ppn=2 my_script Right: qsub -V -l nodes=2:ppn=2 -q student_long my_script You want multiple cores. Cores are indicated by the ppn operand that follows the request for the number of nodes. Same as before: you may use one of the two formats to make the request. For the purposes of brevity, I will show only first method; you should be able to construct the second method from previous examples.
#PBS -l nodes=2:ppn=2 #PBS -l nodes=3:ppn=2 #PBS -l nodes=4:ppn=2 Obviously, you specify the ppn only when it is greater than one, one being the default. As stated before, one core is the default. If you don't
specify the number of cores, then only one is assigned.
The following two command invocations do the same thing;
both request one core (the first includes the request;
the second, let the system default to one core):
The requested cores will be on the same machine. If you request one machine with four cores, the system will look for a single machine having four cores. If it fails to find a matching machine, it will NOT attempt to allocate two machines and two cores on each of the two machines. Instead, your job does not run (queues) because the system will be unable to satisfy your request. The resources (not just cores but other requested resources including but not limited to nodes) may be owned by another job and, when that job completes and the resource(s) become available, they will be assigned to your job. However, if you request resources that will never be available ("I want eight cores on one node" is an example of a nonexistent resource), your job will never run. A VERY IMPORTANT POINT If your program or software is neither parallel nor distributive, specifying more than one node and/or more than one processor DOES ABOSLUTELY NOTHING other than waste resources that could be used by some other client.
Gaussian is an example of a parallel/distributive package; GaussView, producted by the same company that wrote Gaussian, is not. If your program is not parallel/distributive, then it is SERIAL. That means it execute on a single processor on a single node. If the software or program is SERIAL, do not include the nodes and ppn components on the qsub command. Including them will NOT (REPEAT, WILL NOT) make your program run faster. |
||
| What are the limits regarding multiple processors and machines | ||
|
Currently, there is a chart that is displayed when you login.
If you failed to notice the chart or if it scrolled off the
screen, you may display it along with other login messages
by using the following argo command:
At some point, that chart will be removed from "message of the day file" (motd). There are commands that you can use to display queue limits:
To see other policies (rules) of the student_short queue:
The samples use the student_short queue. You may substitute the
name of other queues (student_medium, student_long, staff,
dedicated) to get the corresponding settings for them:
|
||
| How long can my job run | ||
|
Queues have two values regarding how long a job may run: a
default and a maximum. The default is used when a user does not
specify how long to run the job, called walltime (more about
this in a moment). The maximum is exactly that: regardless of
how much time a user requests, a job is permitted to run no
longer than the maximum. The maximums (per queue) are shown
in the output of the qstat -q command (the walltime column): ![]() The defaults are show using the qmgr command: qmgr -c "list queue student_short" | grep resources_default.walltime qmgr -c "list queue student_medium" | grep resources_default.walltime qmgr -c "list queue student_long" | grep resources_default.walltime qmgr -c "list queue staff" | grep resources_default.walltime qmgr -c "list queue dedicated" | grep resources_default.walltime Walltime is specified, like the queue, as either an operand on the
qsub command or as a value in the text file. In your first run, let
the system use the default. If you are running an interactive job,
specify a value considerably less since you know how long you plan to
interact with the software. Walltime MUST be requested using the
format HHH:MM:SS and the system interprets the values from right to
left. Examples:
There is a gotcha regarding walltime. As was stated, the
system reads the walltime specification from right to left and all three
fields must be included even if just to contain zeroes. As an
example, the following format does not request 720 hours; instead,
it asks for only 12 hours:
The first field (00) is the number of seconds requested and the
second field (720) is the number of minutes. 720 minutes (divided
by 60 minutes per hour) translates to 12 hours. To request 720
hours:
There is no command to tell you how long a job will run (in other words, what to enter as the walltime). You just have to estimate. |
||
| Documentation | ||
|
||
| 2010-1-28 ACCC Systems Group |
|