ACCC Home Page ACADEMIC COMPUTING and COMMUNICATIONS CENTER
Accounts / Passwords Email Labs / Classrooms Telecom Network Security Software Computing and Network Services Education / Teaching Getting Help
 

Gaussian 03 on argo-new

   
 
     
To use Gaussian 03, you need:
 
  • Your login shell must be the C-shell
  • A .cshrc file in your home directory
  • A Default.Route file in your home directory
  • A .tsnet.config file in your home directory
  • The subg03 script
 
     
C Login Shell
 

Your login shell must be C. By default, all newly-created accounts use the bash shell (/bin/bash). To find out what shell you currently use:

echo $SHELL

If you see /bin/bash, then you are using bash. If you see /bin/csh, then you're using C.

To change your shell to C, enter the following command:

chsh -s /bin/csh

The result should be

Changing shell for jsmith.

If you include your netid in the command:

chsh -s /bin/csh jsmith

you will be prompted to enter your argo login password for security reasons:

Changing shell for xxxxxx.
Password:

The C-shell will be available to you the next time you login.

 
     
.Cshrc file
 

You will need a .cshrc file (the leading period is required followed by all lowercase letters) in your home directory. If it doesn't exist, create it. The file must include the following lines (copy it exactly as it appears below):

setenv g03root "/usr/common"
setenv GAUSS_SCRDIR "/tmp"
source $g03root/g03/bsd/g03.login
if (! ($?LD_LIBRARY_PATH)) then
setenv LD_LIBRARY_PATH "/usr/common/g03"
else
setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH":$g03root/g03"
endif

Warning: avoid using Windows/DOS utilities to cut and paste the above into your .cshrc file; it may add DOS-type carriage return/line feeds that are not the same as those in LINUX, thereby causing an error when you login.

The permissions for the file should be 755:

cd ; chmod 755 .cshrc

 
     
Default.Route file
 

It is recommended that you have a copy of the Gaussian03 system default file. Changes made to your copy will override system defaults. To get a copy:

cp -p /usr/common/g03/Default.Route $HOME

The system defaults are:

-S- UIC
-#- MaxDisk=2GB
-M- 64MB

The permissions for the file should be 755:

cd ; chmod 755 Default.Route

 
     
.Tsnet.config file
 

The .tsnet.config (leading period is required followed by all lowercase letters) is required. The file is your local copy of the Linda global system configuration file and it will allow you to customize the Linda/g03 environment. To get a copy which must be placed in your home directory:

cp -p /usr/common/g03/.tsnet.config.model $HOME/.tsnet.config

The permissions for the file should be 644:

cd; chmod 644 .tsnet.config

To understand the changes you must make to the file, you will first need to know more about the subg03 script.

 
     
Subg03 script
 

The subg03 script (also referred to as g03sub) starts a Gaussian job. To get a copy:

cp -p /usr/common/g03/subg03 $HOME

Change two lines in your copy; the easier of the two to explain is the second one:

set input = my_g03_input_file.in

text

The filename must have as its last three characters .in. If the name does not end with .in, use the Linux mv command to rename the file. For example, your input file is named mydata:

mv mydata mydata.in

The name is placed after the equal sign in the set input line:

set input = mydata.in

The location of your input file is defined in the third line:

set WORKDIR = "/scratch/$USER"

If you prefer to have your input files in your home directory, then:

set WORKDIR = "$HOME"

The first line in the subg03 script indicates the number of worker nodes to use to run your job. There are two mutually exclusive formats for the number of nodes line. Use one or the other but not both in the same run. The two generic formats:

  1. @ N = number_of_nodes
  2. set NODES = node+node+node...+node

The first format:

@ N = 4

Here, you are telling the batch system to execute your g03 job on four nodes. Since you identify only the number of nodes and not which ones, the batch system will use system load averages as the basis for deciding which four nodes.

The second format:

set NODES = node+node+node...+node

Here, you identify the particular nodes to be used as workers. Examples:

set NODES = argo1-3

  • Use one node: argo1-3

set NODES = argo4-4+argo4-3

  • Use two nodes: node argo4-4 and argo4-3

set NODES = argo1-4+argo1-3+argo1-2

  • Use three nodes: argo1-2, argo1-4, and argo1-3

set NODES = argo1-4+argo1-3+argo1-2+argo1-1

  • Use four nodes: argo1-1, argo1-2, argo1-3, and argo1-4

The order of the nodes in the set NODES statement is not important; the nodes do not need to be in ascending or descending order. Also, it is not required that the nodes be in the same group though it is STRONGLY recommended. The following is an example of mixing nodes from different groups as well as having them in a random order:

set NODES = argo1-3+argo1-2+argo3-4+argo1-4

Three of the nodes are from group one; the other, from group three. There is significance to the first node in the set NODES statement. More on that later.

Back to the .tsnet.config file. The first line in the file is

Tsnet.Appl.nodelist: xxxxxx xxxxxx xxxxxx xxxxxx

where xxxxxx xxxxxx xxxxxx xxxxxx are the names of the nodes listed in the set NODES statement. For example, if you have:

set NODES = argo1-4+argo1-3+argo1-2+argo1-1

then the Tsnet.Appl.nodelist must have those same nodes:

Tsnet.Appl.nodelist: argo1-4.cc.uic.edu argo1-3.cc.uic.edu argo1-2.cc.uic.edu argo1-1.cc.uic.edu

However, it is not necessary for the order of nodes in the two statements to be the same. The following is permitted:

set NODES = argo1-4+argo1-3+argo1-2+argo1-1

Tsnet.Appl.nodelist: argo1-3.cc.uic.edu argo1-4.cc.uic.edu argo1-1.cc.uic.edu argo1-2.cc.uic.edu

Notice that in the Tnset.Appl.nodelist statement the first node is argo1-3; the first node in the set NODES statement is argo1-4.

 
     
Work files - Handling and Location
 

There are several work files that are used in the course of a g03 computation:

text

The files all have the same name; it is the file extension that identifies the type. By default, these files are given a name that appends the job process ID to the characters Gau-. For example, the following all have the name Gau-13089:

    -rw-r--r-- 1 jsmith student 21037056 Feb 14 11:48 Gau-13089.rwf
    -rw-r--r-- 1 jsmith student  5107056 Feb 14 11:48 Gau-13089.chk
    -rw-r--r-- 1 jsmith student 0        Feb 14 11:47 Gau-13089.int
    -rw-r--r-- 1 jsmith student 0        Feb 14 11:47 Gau-13089.d2e
    -rw-r--r-- 1 jsmith student 524288   Feb 14 11:48 Gau-13089.scr

If your job completes successfully, then the subg03 script will delete the files. If you want to retain the files, the comment out the following lines in your subg03 file:

     if ( "$exits" == "Normal termination" ) then
        echo "Normal termination - erasing: "%JOB%.rwf  %JOB%.chk  %JOB%.int  %JOB%.2de  JOB%.src
        rm -f %JOB%.rwf; rm -f %JOB%.chk; rm -f %JOB%.int; rm -f %JOB%.2de; rm -f %JOB%.src
     endif
One of the reasons to keep the files is that they may be used with GaussView. However, the files will use excessive disk space. Retain only those files that are necessary.

You may override the default location and place all of the files in some other directory via the environmental variable GAUSS_SCRDIR. For example, to place them in your scratch directory, enter the following command at the LINUX prompt:

setenv GAUSS_SCRDIR /scratch/$USER

As a result, all subsequent runs during the current login session will place the files in your scratch directory. To make this change permanent, enter the command in your .cshrc file.

If, instead, you wish to relocate one or more of the individual files, this is done by a statement in your input file. For example, if you want to place the rwf file in your scratch directory but leave the remaining files in your home, then you would do two things. First, you would undo any global change made via the setenv statement. Two, you would include the following in your input file:

%rwf=/scratch/jsmith

As a result, the Read_Write file is written to scratch while the other files are written to your home directory. Notice, that you must hardcode your netid into the statement; using the environmental variable $USER will not work.

If you want to relocate a file and then rename it, include the file name after the path:

%rwf=/scratch/jsmith/my_example

The system will append the appropriate extension, in this example, rwf.

Argo is a 32-bit machine. No file may be no larger than 2GB. If any of your scratch files will exceed 2GB, then you must "break up" the file into multiple pieces with each piece no larger than 2GB. What follows is the generic syntax to do this:

rwf=loc1,size,loc2,size2,...

The format also applies to the Integral file and/or the Derivative file. Obviously, for those files, the rwf would be changed to the appropriate extension. What follows is an example of how to partition the rwf file into two separate pieces in the scratch filesystem:

rwf=/scratch/jsmith/piece1,245mw,/scratch/jsmith/piece2,245mw,-1

Several things need to be addressed:

  • Change the example netid, jsmith, to your netid.
  • You may select some other name for the component pieces; the names piece1 and piece2 were selected for illustration purposes.
  • Leave the figure 245mw as is; it translates to 1.9GB.
  • Information is not written equally to the partitions. The first 1.9GB of I/O is done to the first partition. When that piece is filled, then I/O occurs on the second partition.

You are limited to a maximum of 8 partitions.

One of the advantages of using the scratch system is that it has a substantial amount of space. However, there is a performance penalty for using scratch, which is an NFS-mounted filesystem. If your rwf file is less than 5GB, you may use the /tmp filesystem on one of the worker nodes running your job. To do that (the rwf file will again be used for example purposes):

rwf=/tmp/piece1,245mw,/tmp/piece2,245mw,-1

This will partition the file into two 1.9 GB pieces. Which node among the worker nodes in your job is used; each of those worker nodes has its own /tmp directory? This is when the name of the first node in your set NODES comes into play. The node that is listed first is the one whose /tmp filesystem is used. Examples:

set NODES = argo1-4+argo1-3+argo1-2+argo1-1

  • place both partitions (piece1 and piece2) in the /tmp filesystem on argo1-4

set NODES = argo1-1+argo1-3+argo1-2+argo1-4

  • place both partitions in the /tmp filesystem on argo1-1

Since each of the /tmp filesystems is only 5GB, it is best to select a node that is currently not in use. To do that, use the qstat command:

qstat -an

In the following example output, only the Gaussian jobs have been included and the nodes have been highlighted in bold blue text:

    501.argo.cc.uic user1 scali_ex job1.pbs  2218 2 -- -- -- R 43:38
      argo2-4/0+argo2-3/0
    505.argo.cc.uic user2 scali_ex job2.pbs   897 2 -- -- -- R 43:27
      argo2-2/0+argo2-1/0
    509.argo.cc.uic user3 scali_ex job3.pbs 26061 4 -- -- -- R 28:10
      argo1-4/0+argo1-3/0+argo1-2/0+argo1-1/0
    520.argo.cc.uic user4 scali_ex job4.pbs 8512  4 -- -- -- R 19:50
      argo1-3/1+argo1-4/1+argo1-2/1+argo1-1/1
Notice the first nodes (argo2-4, argo2-2, argo1-4, argo1-3) in the list. Gaussian users are not required to place their scratch files in /tmp. But, assume each user does. Therefore, amend your set NODES statement accordingly. If you opt to use four worker nodes in group one (a group that is already being used by two other Gaussian jobs: #509 and 520), make the first node in your set NODES statement either argo1-2 or argo1-1:

set NODES = argo1-2+argo1-3+argo1-4+argo1-1

As a result, there is a greater likelihood of the availablility of space in the /tmp filesystem because you will not be sharing it.

You may have partitions on different types of filesystems: combining a local filesystem (/tmp) with an NFS-mounted filesystem (/scratch). For example:

rwf=/tmp/piece1,245mw,/scratch/jsmith/piece2,245mw,-1

Notice that the first partition, piece1, is written to the /tmp filesystem on the local node whereas the second partition, piece2 is written to the /scratch. This has two advantages:

  • I/O to the first partition is enhanced because it is on the disk local to the node; and
  • You will circumvent the 5GB limit that the size of the /tmp filesystem imposes on your entire file. Instead, you will be able to use up to eight partitions:

    rwf=/tmp/piece1,245mw,/scratch/jsmith/piece2,245mw,/scratch/jsmith/piece3,245 and so ...

 
     
Warnings and Problems
 
  • I want to run Gaussian jobs but I get the following message when I login to argo:

/usr/common/g03/bsd/g03.login: Permission denied..

It means that you are not in the g03 group. If, when you requested your argo account, you failed to mention that you would use Gaussian, then your netid was not added to the appropriate group. Send email to systems@uic.edu requesting that your netid be added to the g03 group.

  • If you see the following message in your output file:

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.

Disregard it; it has to do with a bug in shell scripts and has no bearing on your output.

  • I submitted my subg03 script for execution and received the following message:

Permission denied..

It means that you don't have user execute permssion on the file. To fix it, execute the following command:

chmod u+x subg03

  • My job did not run and I received the following message:

qsub: Job exceeds queue resource limits.

There are several possible causes for this but the most likely is that you identified a node in the set NODES statement (the subg03 file) that does not exist. For example:

set NODES = argo94-4+argo9-3+argo9-2+argo9-1

There is no node argo94-4. Obviously, a typing mistake that is corrected by specifiying node argo9-4.
  • If you see:

ntsnet: unable to schedule the minimum 3 workers.

the likely cause is an incompatability between the number given on %Nprocl and the number of nodes specified in either .tsnet.config file or the subg03 script. For example:

.tsnet.config: Tsnet.Appl.nodelist: argo.cc.uic.edu argo3-4.cc.uic.edu argo3-3.cc.uic.edu argo3-2.cc.uic.edu

subg03: set NODES = argo3-4+argo3-3+argo3-2

input file: %NProcl=4

the problem is that the number of nodes in the %NProcl variable is set to four but only three compute nodes (3-4, 3-3, and 3-2) are listed in both the .tsnet.config and the subg03.

  • If you see something like the following in your output file:

    Erroneous write. write xxxxxx instead of xxxxxx.
    fd = x
    Erroneous write. write xxxxxx instead of xxxxxx.
    fd = x
    writwa
    writwa: Resource temporarily unavailable

    the problem is that your job is writing scratch files to a filesystem that is unable to keep up with the I/O. Most likely, it is an NFS-mounted filesystem like /scratch.

  • If you see something like the following in your output file:

    Erroneous write during file extend. write -1 instead of 4096
    Probably out of disk space.
    Write error in NtrExt1

    the problem is the location for your temporary files is out of space. The default location is on one of the nodes used by your job. Which node is a function of how you selected the nodes to use. If you recall, there are two ways: you select the nodes or you let the system decide for you. Click here for more details. The second method, the one where you tell the system the nodes to use, is the simplier to present. You identify the nodes with the set NODES statement in your *.in file. The location of your files is in /tmp directory on the first node. For example, suppose you had done the following:

      set NODES = argo1-4+argo1-3+argo1-2+argo1-1

    then your files are on argo1-4 in its /tmp directory. On, the other hand, if you used the first format, @ N = number_of_nodes, and let the system pick for you, then you must do a bit of very simple investigative work to find which node. Start by accessing the How Much is Argo Being Used option. On the screen, click on the first item, the one containg "current month." On the next screen, click on your userid (it's on the left side) to display information about the jobs you've run. Find the appropriate job and scrowl the screen to the left until the information under the "Hosts" column is visible. The first node listed there is the one want; it contains your files. For example:
      Start Date/Time    Job Name    Job ID Status  Wall-Time   CPU-Time Memory(kb) Memory(kb) Nodes PPN Hosts
      ---------- -------- ---------- ------ ------ ---------- ---------- ---------- ---------- ----- --- -------
      2007-12-05 18:05:31 myjob.pbs   12345      0   00:03:42   00:02:19      80060     148152     4   1 argo1-1 argo1-2 argo1-3 argo1-4
    your files are located in /tmp on argo1-1

    Once you now where to find the files, you have the ability to remote shell (rsh) into the /tmp directory of that node to locate, and then delete, the offending file:

    rsh argox-x ls -alrt /tmp
    rsh argox-x rm -f /tmp/user-file-too-large

    Assume, for example, your netid is jsmith and you've run out of space in /tmp on argo1-1. To list what files are there (all files - Gaussian as well as anything else):

      rsh argo1-1 ls -al /tmp

    Sample output:
        srwx------ 1 root        99        0 Sep 12 2005 .fam_socket
        drwxrwxrwt 2 xfs        xfs     4096 Mar 20 2007 .font-unix
        drwx------ 2 root      root    16384 Sep 12  2005 lost+found
        -rw-r--r-- 1 jsmith student 21037056 Feb 14 11:48 Gau-13089.rwf
        -rw-r--r-- 1 jsmith student  5107056 Feb 14 11:48 Gau-13089.chk
        -rw-r--r-- 1 jsmith student        0 Feb 14 11:47 Gau-13089.int
        -rw-r--r-- 1 jsmith student        0 Feb 14 11:47 Gau-13089.d2e
        -rw-r--r-- 1 jsmith student    23411 Feb 14 11:48 Gau-13089.scr
        -rw-r--r-- 1  mhoma student 31037056 Feb 11 11:48 Gau-13089.rwf
        -rw-r--r-- 1  mhoma student   104686 Feb 11 11:48 Gau-13089.chk
        -rw-r--r-- 1  mhoma student   234666 Feb 11 11:47 Gau-13089.int
        -rw-r--r-- 1  mhoma student        0 Feb 11 11:47 Gau-13089.d2e
        -rw-r--r-- 1  mhoma student        0 Feb 11 11:48 Gau-13089.scr
        srwxrwxrwx 1  mhoma student   163840 Oct 23 21:42 mpd2.console_bwang9
        srwxrwxrwx 1  mhoma student  2673333 Apr 25  2007 mpd2.console_ksengu1
        srwxrwxrwx 1  mhoma student    67122 Mar 19  2007 mpd2.console_mlahir2
        srwxrwxrwx 1  root     root        0 Mar 20  2007 mpd2.console_root
    
    
    The output includes ALL files, yours (jsmith, highlighted in blue for illustrative purposes) as well as other users. The files are not just those used by Gaussian; other non-Gaussian jobs may be putting files in /tmp. To restrict the listing to only your Gaussian files:

      rsh argo1-1 ls -al /tmp/Gau-* | grep jsmith

    The system permits you to delete only files you own. For example, to delete just your rwf file:

      rsh argo1-1 rm -f /tmp/Gau-13089.rwf

    To delete all of your Gaussian files:

      rsh argo1-1 rm -f /tmp/Gau-*

    It's also important to note that deleting your Gaussian files doesn't ensure that your next job submission (using argo1-1 for temporary files) will complete successfully; you may encounter the same error. You've erased your files, freeing space. Remember, there may be other files in /tmp. You don't own those files and you can't delete them. The file may be using just enough space to cause a repeat of your error, just later. If that's the case, you will need to use a different node.

  • If you see something like:

    node argo.cc.uic.edu(1): port xxxxx: keepalive failed:
    node argox-x.cc.uic.edu(2), port xxxx did not respond, aborting
    Linda Error: node argo.cc.uic.edu(1): keepalive failure
    ntsnet: unable to stop all processes
    ntsnet: process on node argox-x.cc.uic.edu running

    the problem is one of the nodes on which you were running your job has crashed.

  • If you see something like:

    Linda // error: [0,17516]: send_recv.c 304:
    tcpread failed for 64 reading from 64:
    no route to host.
    Signal #15 received

    the problem is that the job has lost communication with one of the nodes on which you were running your job. Select other nodes on which to run your job and resubmit it.

  • If you see something like:

    /usr/common/g03/linda-exe/l302.exel: error while loading shared libraries:
    util.so: cannot load shared object file: No such file or directory

    one possible cause (there are many) is a permission problem with one or more of the nodes specified in the set NODES statement (in the script subg03). For example, if your set NODES statement is:

    set NODES=argo9-4+argo9-3+argo9-2+argo9-1

then the permission problem is on one or more of the machines in the nine group. Execute the following command on each of the nodes, replacing both the X-X with the node name and the id with your netid

rsh -l id X-X uptime

For example,

rsh -l jsmith argo9-4 uptime

If the output from any of the statements is

Permission Denied

then send a message to systems explaining the problem and the node or nodes that produce Permission Denied.

 
     
Starting your Gaussian job
 

The commands to execute a Gaussian job are in the file subg03. To run the job, enter:

    text

DO NOT do the following:

    text
 
     
Recommendations (Important - Please Read)
 

G03 divides the work as evenly as possible among the processors you select for your job (in your set NODES = statement). If the performance of the processors are unequal then the processors which finish first will go idle once they finish.

You want to choose a set of processors which are very similar to avoid lost cycles. Currently, there are three groups of machines (machines within the same group have the same type of processor - for more information click here):

    group one:    argo1-1 to argo4-4
    group two:    argo5-1 to argo8-4
    group three:  argo9-1 to argo16-4
Select processors within the same group. Try to avoid selecting from different groups which have different processor types. For example:

These are OK:
    set NODES = argo4-4+argo4-3+argo4-1+argo4-1
    set NODES = argo6-1+argo6-2+argo6-3+argo6-4

Why are they okay? In the first line, each of the four nodes is from one of the sixteen nodes in group one. The same is true for the second line - each of the four nodes is from one of the sixteen nodes in the second group.

Don't do the following:
    set NODES = argo4-4+argo4-3+argo6-1+argo6-2
Why is this bad? You are mixing from different groups; argo6-2 is grom group two (has a Xeon processor) while the other three machine are from group one (have Opteron processors).
 
     
Additional Help
 

The Gaussian web site

 

 


2008-4-24  ACCC Systems Group
UIC Home Page Search UIC Pages Contact UIC