A.      Cluster Overview
        ----------------
        o Purpose
          * Make a group of computers perform as a single machine.
          * Clustering has been going on for quite a while. In 1994, Donald
            Becker, a NASA researcher invented a way to connect a group 
            of inexpensive off-the-shelf PCs with special software to
            create a single system that could be scaled up to deliver
            supercomputer performance. Cluster name:  Beowulf.
          * Beowulf clusters can have hundreds or thousands of computers.

        o Advantages
          * High performance computing at a fraction of the cost
          * Scale up (add, subtract, modify nodes)
          * Have a heterogeneous environment.

        o Parallel Programming
          o Parallel computation on a Beowulf-type cluster is 
            accomplished by dividing a computation into parts and
            making use of multiple processes.
            * Sometimes a single processor can be used for all the
              processes.
            * Most complex problems involve processes executing on
              separate processors. (In our case, it will be
              separate virtual processors. Each node has one physical
              processor but will have eight virtual processors. More
              on this during the PBS talk.)
          o Processes coordinate their activities by sending an
            receiving messages (called message passing).
          o MPI (Message Passing Interface) is a library specification 
            (separate talk).

B.      Hardware (ours)
        ---------------

B.1     Overview
        --------
        o 17 PCs
        o 2 networks (NICS, cards, hubs, switches)
        o 1 KVM
        o 1 Console/Keyboard/Mouse

B.2     Some details
        ------------
        o Cluster composition:  one master and sixteen compute ("slave")
          nodes.
        o Clients log into master, compile programs, submit them for
          execution on the compute nodes. Clients do not log into
          compute nodes.
        o The ACCC cluster is not:
          * SMP: symmetric multi-processing (more than one CPU in a 
            single box) with each CPU having access to memory and the
            attached devices
          * Heterogeneous
          * Diskless (each compute node has an internal disk and boots
            from it instead of from a server over the network).
        o Four zones (named 4, 3, 2, 1) in the cluster. Within
          each zone there are four compute nodes (4, 3, 2, 1). A zone
          is nothing more than a grouping.

                               Zones
             +---------+---------+---------+---------+
             | argo4-4 | argo3-4 | argo2-4 | argo1-4 |
             +---------+---------+---------+---------+
             | argo4-3 | argo3-3 | argo2-3 | argo1-3 |
      Nodes  +---------+---------+---------+---------+
             | argo4-2 | argo3-2 | argo2-2 | argo1-2 |
             +---------+---------+---------+---------+
             | argo4-1 | argo3-1 | argo2-1 | argo1-1 |
             +---------+---------+---------+---------+

        o Hostname of the compute nodes includes zone and node number.
          Syntax: argoZONE-NODE. Example:  argo3-1 is zone three, node 
          one. Sixteen compute nodes = four zones * four computers 
          (nodes) per zone.
        o Hostnames are shared via NIS.
        o The above chart is a logical presentation of the compute nodes
          but it does not represent how the nodes are cabled together.
          For example, argo4-1 is not directly linked to argo4-2 and
          argo4-2 is not directly linked to 4-3, and so on. More on
          this later.

B.3     Physical layout - front view (Rack)
        -----------------------------------
+------------------------------------------------------------+
|        +--------------------------------------+            |
|        |  KVM switch                          |            |
|        +--------------------------------------+            |
|                                                            |
|        +--------------------------------------+            |
|        | Master node                          |            |
|        +--------------------------------------+            |
|                                                            |
|        +--------------------------------------+            |
|        | Switch                               |            |
|        +--------------------------------------+            |
|        +--------------------------------------+            |
|        | Compute node:    argo4-4             |            |
|        +--------------------------------------+            |
|        | Compute node:    argo4-3             |            |
|        +--------------------------------------+  Zone 4    |
|        | Compute node:    argo4-2             |            |
|        +--------------------------------------+            |
|        | Compute node:    argo4-1             |            |
|        +--------------------------------------+            |
|                                                            |
|        +--------------------------------------+            |
|        | Compute node:    argo3-4             |            |
|        +--------------------------------------+            |
|        | Compute node:    argo3-3             |            |
|        +--------------------------------------+  Zone 3    |
|        | Compute node:    argo3-2             |            |
|        +--------------------------------------+            |
|        | Compute node:    argo3-1             |            |
|        +--------------------------------------+            |
|                                                            |
|        +--------------------------------------+            |
|        | Console                              |            |
|        +--------------------------------------+            |
|                                                            |
|        +--------------------------------------+            |
|        | Ethernet hub                         |            |
|        +--------------------------------------+            |
|        +--------------------------------------+            |
|        | Compute node:    argo2-4             |            |
|        +--------------------------------------+            |
|        | Compute node:    argo2-3             |            |
|        +--------------------------------------+  Zone 2    |
|        | Compute node:    argo2-2             |            |
|        +--------------------------------------+            |
|        | Compute node:    argo2-1             |            |
|        +--------------------------------------+            |
|                                                            |
|        +--------------------------------------+            |
|        | Compute node:    argo1-4             |            |
|        +--------------------------------------+            |
|        | Compute node:    argo1-3             |            |
|        +--------------------------------------+  Zone 1    |
|        | Compute node:    argo1-2             |            |
|        +--------------------------------------+            |
|        | Compute node:    argo1-1             |            |
|        +--------------------------------------+            |
+------------------------------------------------------------+

B.4     Nodes
        -----
        o Two types:
          * Master node   (one)
          * Compute nodes (sixteen)

B.4.1   Details about the master node
        -----------------------------
        o Clients log into master, compile programs, submit them for
          execution on the compute nodes. Clients cannot log onto
          compute nodes.
        o Software on the master.
        o Has access to both the outside world and the private network
          which is one connection (there are two) among the compute
          nodes.

        o Hardware particulars
          * 2U system box (1U = 1.75 inches)
          * GA-7VTX motherboard (used with Athlon processors)
          * AMD 1600 XP processor (1.4GHz Athlon) Not SMP, just one CPU 
            within.
            % 128K L1 (cache memory on the chip)
            % 256K L2 (cache memory on separate chip)
          * 768MB DDR DRAM Memory (three 256MB DIMMS)
            % DDR is double data rate. Twice as much can be transfered
              because transfer occurs when the clock signal bounces from
              LOW to HIGH but also from HIGH to LOW. (SDRAM transfers
              only from LOW to HIGH)
          * Dual NIC (Intel Etherexpress) (eth0/eth1)
            % eth0: 172.16.0.2      (access to private network)
            % eth1: 128.248.121.64  (access to the world)
          * Two 80GB Maxtor IDE 5400 drives (hda/hdb)
            % hda:  master
            % hdb:  backup of hda
          * CDROM (Sony 48X IDE)
          * Floppy (Sony 1.44)
          * Video card (Trident 9750 with 4MB of Synchronous Graphics RAM
            (SGRAM)
          * Adaptec 29160 external SCSI port

B.4.1.1 Rear view (master node)
        -----------------------


          +--------------------------------------+
          |                                      |
To KVM <- | M          C               E0   E1 ----> To switch
To KVM <- | K          |               |    S    |
          +----------- | ------------- | --------+
                       |               |
                       +--> To KVM     +--> To ethernet hub

           M:     Mouse
           K:     Keyboard
           C:     Console
           E0/E1: ports on the dual NIC
           S:     SCSI interface

B.4.2   Details compute nodes
        ---------------------                                                              c
        o Sixteen (all configured the same).
        o Clients do not log in.
        o Firewalled by master node.

        o Hardware particulars
          * 1U system box
          * GA-7VTX motherboard
          * AMD 1600 XP processor (1.4GHz) Not SMP, just one CPU within.
            % 128K L1 (cache memory on the chip)
            % 256K L2 (cache memory on separate chip)
          * 768MB DDR DRAM Memory (three 256MB DIMMS)
          * Single port NIC (Intel Etherexpress) (eth0)
            % eth0: 172.16.X.X      (access to private network)
          * 40GB Western Digital internal hard drive (hda)
          * Video card (Trident, same as master)
          * Dolphin D335 Mother/Daughter combo interconnect cards

          ***  No floppy or CDROM drives (mistake)  ***

          Bob Hyman is working on a procedure to connect an external
          floppy to the USB port for booting purposes (also for
          rescue mode). Or, we will stick a floppy into each. Or,
          we will do a network boot.

B.4.2.1 Rear view (same for each compute node)
        --------------------------------------

          Back - compute node
          +--------------------------------------+
To KVM <- | M          C     E0         | D1  D2 | <--> Dolphin 
To KVM <- | K          |     |          | D3  D4 | <--> Dolphin 
          +----------- | --- | ------------------+
                       |     |
            To KVM <---+     +--> To ethernet hub

           M:     Mouse
           K:     Keyboard
           C:     Console
           E0     NIC
           D1/D2: Dolphin card (mother)
           D3/D4: Dolphin card (daughter piggybacked onto mother)

B.5     Console and KVM switch (Keyboard/Video/Monitor)
        -----------------------------------------------
        o Master and compute nodes share a single console/mouse/keyboard
        o Console
          * Situated in the middle of the rack. Pulls out and screen
            flips up.
          * Screen size: 15.1
        o KVM located at the top of the rack.
        o KVM is a single bank Belkin 16 channel device. Bank number 0 
          (since we have only one bank).
        o Argo1-1 is not attached to KVM (16 channel KVM but we have
          seventeen computers-one master and sixteen compute nodes.
          One of the compute nodes had to be left off).

B.5.1   Front view
        ----------
                 +--------------------------------------+
                 |                   +----------------+ |
                 |  +--+--+          |123456789ABCDEFG| <--- Channels
 Bank/Channel ----> |0 |X |          +----------------+ |
 Indicator       |  +--+--+                +----+       |
                 |                         |BS|C|  <------ Toggle Buttons
                 |                         +----+       |
                 +--------------------------------------+

                 BS: Bank scan (does nothing since we have only one bank)
                 
        o To go to the next channel (node) in the channel list, strike the C 
          button.
        o To jump to a particular machine, enter the following sequence:
            
          For example, to jump to argo4-3, strike scroll lock twice,
          followed by a zero, followed by the letter a (capitalization
          is not required).
        o When you switch channels, the name of the new node is
          displayed at the top of the console for a couple of seconds.
        o The current channel is displayed in two places
          * In the box on the left, front side (contains an X in the above
            diagram)
          * Highlighted in the channel box, the one on the right front
            side.
        o Channel/Node Correspondence
          * No logic to the correspondence.
          * Channel   Node            Channel   Node
               1     argo1-2             9     argo4-2
               2     argo2-2             A     argo4-3
               3     Master              B     argo2-3
               4     argo4-4             C     argo1-4
               5     argo3-1             D     argo2-1
               6     argo3-2             E     argo4-1
               7     argo2-4             F     argo1-3
               8     argo3-4             G     argo3-3

            As you can see, no argo1-1.

B.6     Networks
        --------

B.6.1   Overview
        --------
        o Most demanding communication requirement are not with the
          external environment but with other nodes on the SAN (System
          Area Network: network optimized for use as a dedicated
          communication medium within a commodity cluster).
        o Every node may need to interact with every other node,
          independently or together, to move a wide range of data types
          between processors.
          * Data may be large blocks of contiguous information represeting
            subunits of very large global data (need bandwidth).
          * Data my be small packets containing single values or
            synchronization signals to support collective operations
            (need low latency)

B.6.1   Some Details
        ------------
        o Two networks
          * Fast Ethernet (thru SMC hub)
          * Dolphin Interconnect

B.6.1.1 Fast Ethernet
        -------------
        Overview
        o For out-of-band management (basically anything but
          process communication:  NFS, NIS, pvfs, etc).
        o All machines, master and compute nodes, are connected 
          together thru it.
        o Advantages
          * Inexpensive
          * Ubiquitous (drivers integrated into LINUX and well tested)
          * Easy to support
          * Gig E is backward compatible (mixed-mode)
        o Disadvantages
          * FE with TCP/IP provides 90-95 Mbps with latencies in the
            hundreds of microseconds (very bad latency).

        Some details about the FE in the argo cluster
        o One NIC in a compute node
        o Since UIC uses a class B address structure, I continued
          that convention for the private network:
             172.16             16 bit network address
               X.X              16 bit host address
        o Regarding compute nodes there is a relationship from hostname 
          to host address:
              argoZONE-NODE     172.16.ZONE.NODE
          * Examples:
             argo4-4  172.16.4.4  argo4-3 172.16.4.3  argo4-2  172.16.4.2
             argo3-4  172.16.3.4  argo3-3 172.16.3.3  argo3-2  172.16.3.2
             ...
             So, for argo4-2, the zone is 4 and the host is two. Allows
             one to look at the rack and know the IP address.
        o The exception is the master node (two NICs):
            eth0: 172.16.0.2      (access to private network)
            eth1: 128.248.121.64  (access to the world)

          No logic why I designated master 0.2 as host portion of the IP
          address.

          JackG gave me the IP address for eth1 (master). The 64 was
          wishful thinking-we would eventually have 64 nodes.

          If traffic becomes too much, JackG will replace the SMC hub 
          (broadcast device) with a switch (multi-port learning device).

B.6.1.2 Dolphin Wulfkit
        ---------------
        Overview
        o Dolphin is an SCI-based interconnect for Beowulf systems.
        o Dolphin is hardware. Refers to a card (mother) and
          a piggybacked daughter.
        o Includes closed-source binary drivers and an
          implementation of the MPI tuned for the SCI network. This
          is the software. It is referred to as Scali.
        o SCI is IEEE standard originally designed to provide an
          interconnect for cache-coherent shared-memory systems.
          (Cache coherency protocol: uniform view of the values
          in memory.)
        o Used for parallel process communication among compute nodes.

+-----------------------------------------------------------------------+
|                                                                       |
|        Some node                               Some other node        |
|   +----------------------+                  +----------------------+  |
|   |    +-----------+     |                  |    +-----------+     |  |
|   |    | Process 1 | --msg-> Interconnect ---->  | Process 5 |     |  |
|   |    +-----------+     |                  |    +-----------+     |  |
|   |                      |                  |                      |  |
|   |    +-----------+     |                  |    +-----------+     |  |
|   |    | Process 2 --------> Interconnect ---->  | Process 8 |     |  |
|   |    +-----------+     |                  |    +-----------+     |  |
|   +----------------------+                  +----------------------+  |
+-----------------------------------------------------------------------+

        Advantages
        o High performance including latency
          * Latency (delay): .25 - .5 microseconds.
            % Gigabyte ethernet: 24 - 30 microseconds.
            % Don't know if these vendor supplied latency numbers are
              round-trip or one-way
          * Bandwidth:  1 Megabyte per second.

        Disadvantages
        o Current PC motherboard chip sets do not support coherency
          systems required to construct an SCI-based shared memory 
          Beowulf.

        Some details
        o Dolphin cards are only in compute nodes.
        o Model D335 (mother with a piggyback daughter)
        o PCI card:  32 bit/33 MHz
        o Technical specification
          * www.dolphinics.com/products/pci64_adapter_card.html
        o Pictures
          * www.dolphinics.com/placed/subpages/photos/top_D333_low.jpg
          * www.dolphinics.com/placed/subpages/photos/2000WukfkitAbove.jpg
        o Additional information
          * www.dolphinics.com
          * www.scali.com

        Rear view (same for each compute node)
        --------------------------------------

          +--------------------------------------+
          | M          C     E0         | D1  D2 | <-- Dolphin mother
          | K                           | D3  D4 | <-- Dolphin daughter
          +--------------------------------------+

Connectors D1 and D2 are on the mother; D3 and D4 are on the daughter
D1:  In connection  (mother)     D3:  In connection  (daughter)
D2:  Out connection (mother)     D4:  Out connection (daughter)

Mother card:  Intraring communication (connects nodes in the same zone/
              ring).
Daughter card: Interring communication (connects nodes in different zones/
              rings).
Cables:
Blue label is out.   
Yellow label is in.


Ring (zone)                   Ring 2/Ring 3    Ring 4
+-------------------------+                   +-------------------------+
| Node     M        D     |                   | Node     M        D     |
| +---+ +------+ +------+ |                   | +---+ +------+ +------+ |
| |1.4| |In|Out| |In|Out| |   ...     ...     | |4.4| |In|Out| |In|Out| |
| +---| +------+ +------+ |                   | +---| +------+ +------+ |
| |1.3| |In|Out| |In|Out| |                   | |4.3| |In|Out| |In|Out| |
| +---| +------- +------+ |                   | +---| +------- +------+ |
| |1.2| |In|Out| |In|Out| |                   | |4.2| |In|Out| |In|Out| |
| +---| +------- +------+ |                   | +---| +------- +------+ |
| |1.1| |In|Out| |In|Out| |                   | |4.1| |In|Out| |In|Out| |
| +---+ +------+ +------+ |                   | +---+ +------+ +------+ |
+-------------------------+                   +-------------------------+

How each compute node connects to every other compute node through the 
Dolphin cards ("Tim Eisler SNA job security diagram")

                     Two-dimensional SCI torus topography
                     ------------------------------------

        Ring 1              Ring 2              Ring 3              Ring 4
+-----------------+ +-----------------+ +-----------------+ +-----------------+
|Mother card-L0   | |Mother card-L0   | |Mother card-L0   | |Mother card-L0   |
|1-1 out -> 1-3 in| |2-1 out -> 2-3 in| |3-1 out -> 3-3 in| |4-1 out -> 4-3 in|
|1-3 out -> 1-4 in| |2-3 out -> 2-4 in| |3-3 out -> 3-4 in| |4-3 out -> 4-4 in|
|1-4 out -> 1-2 in| |2-4 out -> 2-2 in| |3-4 out -> 3-2 in| |4-4 out -> 4-2 in|
|1-2 out -> 1-1 in| |2-2 out -> 2-1 in| |3-2 out -> 3-1 in| |4-2 out -> 4-1 in|
+-----------------+ +-----------------+ +-----------------+ +-----------------+
|Daughter card-L1 | |Daughter card-L1 | |Daughter card-L1 | |Daughter card-L1 |
|1-1 out -> 3-1 in| |2-1 out -> 1-1 in| |3-1 out -> 4-1 in| |4-1 out -> 2-1 in|
|1-2 out -> 3-2 in| |2-2 out -> 1-2 in| |3-2 out -> 4-2 in| |4-2 out -> 2-2 in|
|1-3 out -> 3-3 in| |2-3 out -> 1-3 in| |3-3 out -> 4-3 in| |4-3 out -> 2-3 in|
|1-4 out -> 3-4 in| |2-4 out -> 1-4 in| |3-4 out -> 4-4 in| |4-4 out -> 2-4 in|
+-----------------+ +-----------------+ +-----------------+ +-----------------+

Pattern (intrazone - mother card connections)
---------------------------------------------
Node 1 connects to node 3
 "   3 connects to  "   4
 "   4 connects to  "   2
 "   2 connects to  "   1

Pattern (interzone - daughter card connections)
-----------------------------------------------
Zone 1 connects to zone 3 
 "   3 connects to  "   4
 "   4 connects to  "   2
 "   2 connects to  "   1

        Chart of the nodes and cabling (constructing from the above
        information). Notice the differences between this chart and
        the logical one given on page two.

     +-----------------------------------------------------+
     |     +------+                                        |
     |     |      |                                        |
     |     | +---------+---------+---------+---------+     |
     |     | | argo4-2 | argo2-2 | argo1-2 | argo3-2 |     |
     |     | +---------+---------+---------+---------+     |
     |     | | argo4-4 | argo2-4 | argo1-4 | argo3-4 |     |
     +--->   +---------+---------+---------+---------+  ---+
           | | argo4-3 | argo2-3 | argo1-3 | argo3-3 |
           | +---------+---------+---------+---------+
           | | argo4-1 | argo2-1 | argo1-1 | argo3-1 |
           | +---------+---------+---------+---------+
           |      ^
           |      |
           +------+

             Get rid of the leading word argo and the dash

             +----+----+----+----+
             | 42 | 22 | 12 | 32 |
             +----+----+----+----+
             | 44 | 24 | 14 | 34 |
             +----+----+----+----+
             | 43 | 23 | 13 | 33 |
             +----+----+----+----+
             | 41 | 21 | 11 | 31 |
             +----+----+----+----+

             Now look at the above diagram with rows and columns
             (xy coordinates)

             +----+----+----+----+
         x2  | 42 | 22 | 12 | 32 |
             +----+----+----+----+
         x4  | 44 | 24 | 14 | 34 |
 Rows        +----+----+----+----+
         x3  | 43 | 23 | 13 | 33 |
             +----+----+----+----+
         x1  | 41 | 21 | 11 | 31 |
             +----+----+----+----+
               4y   2y   1y   3y

                     Columns

             So there are four rows and four columns (in the Scali
             documentation, they are referred to as rings). Later,
             I will refer to this as the "golden list".
               x2  4y
               x4  2y
               x3  1y
               x1  3y

             y dimension: nodes within the same zone
             x dimension: nodes in different zones

          All the above was to help explain routing

        o Two types of routing in Scali
          * Scali routing
          * Dimensional
        o Scali routing
          * Fault tolerant algorithm
          * Capable of maintaining full connectivity among the
            remaining nodes even if more than one node has failed.
          * When all nodes are working, it is equal to dimensional
            routing (XY or YX).
          * This is what we're using.
        o Dimensional routing
          * All motion in the first dimension must be done before
            routing in the next dimension
          * Two types:  XY or YX
            % XY
              + X is the first dimension and Y is the second dimension.
              + Will go to other nodes in a different zone.
              + Default
            % YX
              + Y is the first dimension and X is the second dimension
              + Will go to other nodes in the same zone.
          * Not fault tolerant - you will lose all nodes in the
            X or Y dimension if one node goes down. That's not the
            case for Scali routing.

        Procedure to determine how many nodes remain available in
        an "up" state on the interconnect if we lose a node (using
        Scali routing which is fault-tolerant).

        Step 1:  Identify the bad node on the cabling grid.
                   Example: argo4-1.
                   +----+----+----+----+
               x2  | 42 | 22 | 12 | 32 |
                   +----+----+----+----+
               x4  | 44 | 24 | 14 | 34 |
 Rows              +----+----+----+----+
               x3  | 43 | 23 | 13 | 33 |
                   +----+----+----+----+
               x1  | BAD| 21 | 11 | 31 |
                   +----+----+----+----+
                     4y   2y   1y   3y

                          Columns
        Step 2:  Identify in the cabling grid the row and column of 
                 which the bad node is a member.
                   row:     x1
                   column:  4y
        Step 3:  Identify in the cabling grid the other node members 
                 in the row and column containing the bad node.
                   Row:     argo2-1, argo1-1, and argo3-1 are in
                            the same row (x1) as the bad node (argo4-1)
                   Column:  argo4-3, argo4-4, argo4-2 are in the
                            same column (4y) as the bad node.
        Step 4:  Remove from the "golden" list of columns and rows the 
                 row and column containing the bad node.
                   Golden list      Afterwards list
                     x2  4y           x2  --
                     x4  2y           x4  2y
                     x3  1y           x3  1y
                     x1  3y           --  3y
        Step 5:  Regarding the other node members in the row and column
                 containing the bad node, see if each member has access
                 to at least one column or one row in the "Afterwards
                 list". If so, the node remains in the "up" state in 
                 the interconnect.

                 Nodes in row x1:  argo2-1, argo1-1, and argo3-1
                   Nodes       Member  One of them    Status
                                       still in list
                   argo2-1     x1 2y       Yes-2y       up
                   argo1-1     x1 1y       Yes-1y       up
                   argo3-1     x1 3y       Yes-3y       up

                 Nodes in column 4y:  argo4-3, argo4-4, argo4-2
                   argo4-3     x3 4y       Yes-x3       up
                   argo4-4     x4 4y       Yes-x4       up
                   argo4-2     x2 4y       Yes-x2       up

                 The other nine nodes are not impacted by the loss
                 of argo4-1 and are not routed around.
        Step 6:  Total the nodes in "up" status (15) and redraw the
                 interconnect diagram.

                   +---------+---------+---------+---------+
                   | argo4-2 | argo2-2 | argo1-2 | argo3-2 |
                   +---------+---------+---------+---------+
                   | argo4-4 | argo2-4 | argo1-4 | argo3-4 |
                   +---------+---------+---------+---------+
                   | argo4-3 | argo2-3 | argo1-3 | argo3-3 |
                   +---------+---------+---------+---------+
                             | argo2-1 | argo1-1 | argo3-1 |
                             +---------+---------+---------+
                 The result is the loss of only one node. Scali will
                 route around node argo4-1.

        What happens if you lose two nodes? Use the same procedure
        as outlined above.

        Step 1:  Identify the bad nodes on the cabling grid.
                   Example: argo4-1 and argo3-2.
                   +----+----+----+----+
               x2  | 42 | 22 | 12 | BAD|
                   +----+----+----+----+
               x4  | 44 | 24 | 14 | 34 |
 Rows              +----+----+----+----+
               x3  | 43 | 23 | 13 | 33 |
                   +----+----+----+----+
               x1  | BAD| 21 | 11 | 31 |
                   +----+----+----+----+
                     4y   2y   1y   3y

                          Columns
        Step 2:  Identify in the cabling grid the rows and columns of
                 of which the bad nodes are members.
                   Bad Node   Row        Column
                   argo4-1    x1         4y
                   argo3-2    x2         3y
        Step 3:  Identify in the cabling grid the other node members 
                 in the rows and columns containing the bad nodes.
                   Argo4-1  Row:     argo2-1, argo1-1, argo3-1
                            Column:  argo4-3, argo4-4, argo4-2
                   Argo3-2  Row:     argo4-2, argo2-2, argo1-2
                            Column:  argo3-4, argo3-3, argo3-1
        Step 4:  Remove from the "golden" list of columns and rows the 
                 rows and columns containing the bad nodes.
                   Golden list      Afterwards list
                     x2  4y           --  --
                     x4  2y           x4  2y
                     x3  1y           x3  1y
                     x1  3y           --  --
        Step 5:  Regarding the other node members in the rows and 
                 columns containing the bad nodes see if each member 
                 has access to at least one column or one row in the
                 "Afterward list". If so, the node remains "up" and
                 in the interconnect.

                 For argo4-1
                   Nodes in row:  argo2-1, argo1-1, and argo3-1
                     Nodes       Member  One of them    Status
                                         still in list
                     argo2-1     x1 2y       Yes-2y       up
                     argo1-1     x1 1y       Yes-1y       up
                     argo3-1     x1 3y       No           down
                   Nodes in column:  argo4-3, argo4-4, argo4-2
                     argo4-3     x3 4y       Yes-x3       up
                     argo4-4     x4 4y       Yes-x4       up
                     argo4-2     x2 4y       No           down
                 For argo3-2
                   Nodes in row:  argo4-2, argo2-2, and argo1-2
                     Nodes       Member  One of them    Status
                     argo4-2     Already down - see above
                     argo2-2     x2 2y       Yes-2y       up
                     argo1-2     x2 1y       Yes-1y       up
                   Nodes in column:  argo3-4, argo3-3, argo3-1
                     argo3-4     x4 3y       Yes-x4       up
                     argo3-3     x3 3y       Yes-x3       up
                     argo3-1     Already down - see above
                 The other four nodes are not impacted by the loss
                 of argo4-1 and argo3-2; they remain up.
        Step 6:  Total the nodes in "up" status (12) and redraw the
                 interconnect diagram ("it's the Chevrolet logo")

                             +---------+---------+
                             | argo2-2 | argo1-2 |
                   +---------+---------+---------+---------+
                   | argo4-4 | argo2-4 | argo1-4 | argo3-4 |
                   +---------+---------+---------+---------+
                   | argo4-3 | argo2-3 | argo1-3 | argo3-3 |
                   +---------+---------+---------+---------+
                             | argo2-1 | argo1-1 |
                             +---------+---------+

                 The result is the loss of four nodes. Scali will
                 route around them. Nodes 4-1 and 3-2 are bad. Nodes
                 4-2 and 3-1 are collateral damage.

        Here's the resulting working interconnect if we lose three 
        nodes: argo4-1, argo3-2, and argo2-3:
                                       +---------+
                                       | argo1-2 |
                   +---------+---------+---------+---------+
                   | argo4-4 | argo2-4 | argo1-4 | argo3-4 |
                   +---------+---------+---------+---------+
                                       | argo1-3 |              
                                       +---------+
                                       | argo1-1 |
                                       +---------+

        Here's the resulting working interconnect if we lose three 
        nodes: argo4-1, argo3-2, argo2-3, and argo3-4:
                                       +---------+
                                       | argo1-2 |
                                       +---------+
                                       | argo1-4 |
                                       +---------+
                                       | argo1-3 |              
                                       +---------+
                                       | argo1-1 |
                                       +---------+

Status (more on this in a separate presentation)
------
Two ways to get the status of the interconnect:
  1) GUI (scadesktop)
  2) Command mode

Because of the current hardware problem, the interconnect status via the GUI
is not available. Here is command mode:

On node argo4-4:
/opt/scali/bin/scinfo -l

ScaSCI (Scali SCI driver) v.2.4.9 (m), adapter 0, type D335, nodeid 0x1000
Current number of unterminated open calls:      4

Link hardware info for adapter 0
-------------------------------------------------------------------------
LC[0]: up   68166.536 s, enabled  sr 0xa5ab fw 0x1a07 hwid 0xffee
LC[1]: up   68166.236 s, enabled  sr 0xa5ab fw 0x1c07 hwid 0xffed

Link event counters, adapter 0
-------------------------------------------------------------------------
LC[0] LC interrupts:                            3
LC[0] tot. sw.initiated resets:                 1
LC[0] upstream error detections:                3
LC[0] software reinitializations:               4
LC[1] LC interrupts:                            3
LC[1] tot. sw.initiated resets:                 1
LC[1] upstream error detections:                3
LC[1] software reinitializations:               4


On node argo2-1 (the one causing the fuss)
  /opt/scali/sbin/scinfo -l

ScaSCI (Scali SCI driver) v.2.4.9 (m), adapter 0, type D335, nodeid 0x0000
Current number of unterminated open calls:      4

Link hardware info for adapter 0
-------------------------------------------------------------------------
LC[0]: down enabled  (hardware problem)
LC[1]: down enabled  (hardware problem)
LC[2]: down enabled  (hardware problem)

Link event counters, adapter 0
-------------------------------------------------------------------------
LC[0] tot. sw.initiated resets:              3425
LC[0] sw reinit failure count:               3424
LC[0] sw reinit failure resets:               342
LC[1] tot. sw.initiated resets:              3425
LC[1] sw reinit failure count:               3424
LC[1] sw reinit failure resets:               342
LC[2] tot. sw.initiated resets:              3425
LC[2] sw reinit failure count:               3424
LC[2] sw reinit failure resets:               342

Here is the actual error message that no one can figure out:

scaconscaconfsd: ScaliRouting.cpp:663: bool Mesh::SetFailedRing(const class ScaString &, int): Assertion `dim < 2' failed.

More on Scali in a different talk.

C.      Turning on/off the cluster
        --------------------------
C.1     Power is on - rebooting the entire cluster
        o Run the "/sbin/shutdown -r" now command on the master node -
          the master will reboot the cluster nodes as a part of its
          startup procedure (see the following section from the /etc/rc.local
          file):

          #
          # Prevent users from logging in to the master node since we're about to
          # reboot the compute nodes
          #
          /bin/cp /etc/nologin.msg /etc/nologin
          #
          # Reboot the compute nodes
          #
          for loop in 4 3 2 1 ; do
             for loop1 in 4 3 2 1 ; do
                rsh -l root argo$loop-$loop1 /sbin/shutdown -r now
             done
          done
          echo "Sleeping for 6 minutes while the cluster nodes boot"
          /bin/sleep 6m

C.2     Powering on the entire cluster.
        o Start with master and then power on each node (4 3 2 1) in
          each of the four zones.

C.3     Shutting down and Powering off the cluster
        o Two steps:
          + Run the "/sbin/shutoff -h" command on the master node (see
            below)

            #! /bin/sh
            act="r"
            if [ "$1" ]
              then
                if test "$1" = "h" || test "$1" = "H"
                  then act="h"
                fi
            fi
            for loop in 4 3 2 1   <== zones
            do
            for loop1 in 4 3 2 1  <== nodes
            do
            `rsh -l root argo$loop-$loop1 /sbin/shutdown -"$act" now`
            done
            done

          + Run the "/sbin/shutdown" -h command on the master to shut it
            down.

          Obviously, the reverse procedure will not work. And, unless
          there are no other options, don't strike the power or reset
          buttons.

C.4     Rebooting a single compute node
        -------------------------------
        Not a good idea. May impact the PVFS.

D.      Operating System
        ----------------
D.1     Overview
        o Came from RackSaver with Linux 7.1 (kernel 2.4.2-2)
        o Kernel on all machines upgraded to 2.4.9-31
          * Two reasons for upgrade
            % Had to patch routine i387.c with fix from Scali.
            % Incorporate Athlon enhancements 
        o General procedure
          * Used gcc_2.96-85
          * RPM to get the source
          * Patch to modify the i387.c routine

          * make mrproper
          * make menuconfig
          * make dep
          * make clean
          * make bzImage
          * make modules
          * make install
          * make modules
          * make modules_install
          * Modified /etc/lilo.conf
          * lilo -v

        We did this for all seventeen machines. There is a handy product
        called SystemImager that allows one to create a golden image of
        a kernel and then export it to other nodes on the network. I've
        yet to learn it.

        o The /etc/lilo.conf on the master:
            boot=/dev/hda
            map=/boot/map
            install=/boot/boot.b
            prompt
            timeout=50
            message=/boot/message
            linear
            default=linux-2.4.9-31c

            image=/boot/vmlinuz
                    label=linux-2.4.9-31c
                    read-only
                    root=/dev/hda5

            image=/boot/vmlinux-new
                    label=linux-new
                    read-only
                    root=/dev/hda5

            image=/boot/vmlinuz-2.4.2-2
                    label=linux
                    read-only
                    root=/dev/hda5

E.      Partitions and filesystems
        --------------------------
E.1     Master
        o Overview
          Two types of filesystems
            * Linux ext2
            * Parallel Virtual File System (pvfs-separate talk)

        o Partition Layout
          Disk /dev/hda: 255 heads, 63 sectors, 9729 cylinders
          Units = cylinders of 16065 * 512 bytes

            Device Boot    Start       End    Blocks   Id  System
            /dev/hda1   *         1         7     56196   83  Linux (boot)
            /dev/hda2             8      9729  78091965    5  Extended
            /dev/hda5             8        40    265041   83  Linux (root)
            /dev/hda6            41       564   4208998+  82  Linux swap
            /dev/hda7          1089      1742   5253255   83  Linux (tmp)
            /dev/hda8          2006      3051   8401995   83  Linux (opt)
            /dev/hda9          3576      4621   8401995   83  Linux (usr)
            /dev/hda10         5146      5669   4209030   83  Linux (var)
            /dev/hda11         6718      9729  24193890   83  Linux

          No partitions on the master for the parallel filesystem.

        o The above layout is not how the machine came from RackSaver.
          Bob Hyman and I developed and executed a procedure to 
          repartition (expand and contract partition sizes) as well
          as to add partitions without reinstalling the OS.
        o Size information
          * The size of hda6 was based on having at least 4X of memory
            for swap (768MB * 4). And then some wiggle room.
          * Gap between hda6 (swap) and hda7 (tmp) for swap expansion or
            new partitions
          * Gap between hda7 and hda8 to allow for tmp expansion.
          * Gap between hda8 and hda9 to allow for opt expansion.
          * Gap between hda9 and hda10 to allow for usr expansion.
          * Gap between hda10 and hda11 to allow for var expansion.
          Even though some of these partitions are unnecessary for
          a compute node, we elected to include them (for example,
          opt).
        o If we get the raidzone or some such device, then hda11 (home) is
          available.
        o Filesystem particulars

          * Filesystem           1k-blocks      Used Available Use% Mounted on
            /dev/hda5               256667    176745     66670  73% /
            /dev/hda1                54416      7687     43920  15% /boot
            /dev/hda9              8270100   1516604   6333400  20% /usr
            /dev/hda10             4142832     52412   3879972   2% /var
            /dev/hda7              5170696       284   4907752   1% /tmp
            /dev/hda8              8270100    167800   7682204   3% /opt
            /dev/hda11            23814136     55600  22548844   1% /home
            argo4-4:/pvfs-meta    96393600   4898304  91495296   6% /scratch4
            argo3-4:/pvfs-meta    96393600   4898304  91495296   6% /scratch3
            argo2-4:/pvfs-meta    96393600   4898304  91495296   6% /scratch2
            argo1-4:/pvfs-meta    96393600   4898304  91495296   6% /scratch1

            The four scratch areas are the parallel filesystems (separate
            lecture).
 
E.2     Compute nodes
        -------------
        o Overview
          Three types of filesystems
          * Linux ext2
          * Pvfs (/scratchX)
          * NFS  (home)
        o Partition Layout
          Disk /dev/hda: 255 heads, 63 sectors, 4865 cylinders
          Units = cylinders of 16065 * 512 bytes

            Device Boot    Start       End    Blocks   Id  System
            /dev/hda1   *         1         7     56196   83  Linux
            /dev/hda2             8      4865  39021885    5  Extended
            /dev/hda5             8        40    265041   83  Linux
            /dev/hda6            41       406   2939863+  83  Linux
            /dev/hda7           407      1015   4891761   83  Linux
            /dev/hda8          1016      1381   2939863+  83  Linux
            /dev/hda9          1382      1578   1582371   82  Linux swap
            /dev/hda10         1579      1775   1582371   83  Linux
            /dev/hda11         1776      1782     56196   83  Linux
            /dev/hda12         1783      4830  24483028+  83  Linux
        o Filesystem particulars
          * Filesystem           1k-blocks      Used Available Use% Mounted on
            /dev/hda5               256667    142535    100880  59% /
            /dev/hda1                54416      9094     42513  18% /boot
            /dev/hda8              2893628   1517968   1228668  56% /usr
            /dev/hda10             1557464     30032   1448316   3% /var
            /dev/hda6              2893628     76852   2669784   3% /opt
            /dev/hda7              4814936       172   4570176   1% /tmp
            /dev/hda11               54416         4     51603   1% /pvfs-meta
            /dev/hda12            24098424       416  22873860   1% /pvfs-data
            argo.cc.uic.edu:/home
                                  23814136     55600  22548848   1% /home
            argo4-4:/pvfs-meta    96393600   4898304  91495296   6% /scratch4
            argo3-4:/pvfs-meta    96393600   4898304  91495296   6% /scratch3
            argo2-4:/pvfs-meta    96393600   4898304  91495296   6% /scratch2
            argo1-4:/pvfs-meta    96393600   4898304  91495296   6% /scratch1

        o Differences between master and compute nodes regarding the layout
          of partitions and filesystems

                   Partition Master    Compute 
                   +--------+---------+---------+
                   | hda1   |  boot   |  boot   |
                   | hda5   |  root   |  root   |
                   | hda6   |  swap   |  opt    |  *
                   | hda7   |  tmp    |  tmp    |
                   | hda8   |  opt    |  usr    |  *
                   | hda9   |  usr    |  swap   |  *
                   | hda10  |  var    |  var    |
                   | hda11  |  home   |  pvfs   |  *
                   | hda12  |  ----   |  pvfs   |  *
                   +--------+---------+---------+

          On the compute nodes, home is an NFS-mounted filesystem and does not
          need a partition.

          For reasons to be discussed in the PVFS lecture, there are required
          partitions on the master for the parallel file system.