A. Cluster Overview
----------------
o Purpose
* Make a group of computers perform as a single machine.
* Clustering has been going on for quite a while. In 1994, Donald
Becker, a NASA researcher invented a way to connect a group
of inexpensive off-the-shelf PCs with special software to
create a single system that could be scaled up to deliver
supercomputer performance. Cluster name: Beowulf.
* Beowulf clusters can have hundreds or thousands of computers.
o Advantages
* High performance computing at a fraction of the cost
* Scale up (add, subtract, modify nodes)
* Have a heterogeneous environment.
o Parallel Programming
o Parallel computation on a Beowulf-type cluster is
accomplished by dividing a computation into parts and
making use of multiple processes.
* Sometimes a single processor can be used for all the
processes.
* Most complex problems involve processes executing on
separate processors. (In our case, it will be
separate virtual processors. Each node has one physical
processor but will have eight virtual processors. More
on this during the PBS talk.)
o Processes coordinate their activities by sending an
receiving messages (called message passing).
o MPI (Message Passing Interface) is a library specification
(separate talk).
B. Hardware (ours)
---------------
B.1 Overview
--------
o 17 PCs
o 2 networks (NICS, cards, hubs, switches)
o 1 KVM
o 1 Console/Keyboard/Mouse
B.2 Some details
------------
o Cluster composition: one master and sixteen compute ("slave")
nodes.
o Clients log into master, compile programs, submit them for
execution on the compute nodes. Clients do not log into
compute nodes.
o The ACCC cluster is not:
* SMP: symmetric multi-processing (more than one CPU in a
single box) with each CPU having access to memory and the
attached devices
* Heterogeneous
* Diskless (each compute node has an internal disk and boots
from it instead of from a server over the network).
o Four zones (named 4, 3, 2, 1) in the cluster. Within
each zone there are four compute nodes (4, 3, 2, 1). A zone
is nothing more than a grouping.
Zones
+---------+---------+---------+---------+
| argo4-4 | argo3-4 | argo2-4 | argo1-4 |
+---------+---------+---------+---------+
| argo4-3 | argo3-3 | argo2-3 | argo1-3 |
Nodes +---------+---------+---------+---------+
| argo4-2 | argo3-2 | argo2-2 | argo1-2 |
+---------+---------+---------+---------+
| argo4-1 | argo3-1 | argo2-1 | argo1-1 |
+---------+---------+---------+---------+
o Hostname of the compute nodes includes zone and node number.
Syntax: argoZONE-NODE. Example: argo3-1 is zone three, node
one. Sixteen compute nodes = four zones * four computers
(nodes) per zone.
o Hostnames are shared via NIS.
o The above chart is a logical presentation of the compute nodes
but it does not represent how the nodes are cabled together.
For example, argo4-1 is not directly linked to argo4-2 and
argo4-2 is not directly linked to 4-3, and so on. More on
this later.
B.3 Physical layout - front view (Rack)
-----------------------------------
+------------------------------------------------------------+
| +--------------------------------------+ |
| | KVM switch | |
| +--------------------------------------+ |
| |
| +--------------------------------------+ |
| | Master node | |
| +--------------------------------------+ |
| |
| +--------------------------------------+ |
| | Switch | |
| +--------------------------------------+ |
| +--------------------------------------+ |
| | Compute node: argo4-4 | |
| +--------------------------------------+ |
| | Compute node: argo4-3 | |
| +--------------------------------------+ Zone 4 |
| | Compute node: argo4-2 | |
| +--------------------------------------+ |
| | Compute node: argo4-1 | |
| +--------------------------------------+ |
| |
| +--------------------------------------+ |
| | Compute node: argo3-4 | |
| +--------------------------------------+ |
| | Compute node: argo3-3 | |
| +--------------------------------------+ Zone 3 |
| | Compute node: argo3-2 | |
| +--------------------------------------+ |
| | Compute node: argo3-1 | |
| +--------------------------------------+ |
| |
| +--------------------------------------+ |
| | Console | |
| +--------------------------------------+ |
| |
| +--------------------------------------+ |
| | Ethernet hub | |
| +--------------------------------------+ |
| +--------------------------------------+ |
| | Compute node: argo2-4 | |
| +--------------------------------------+ |
| | Compute node: argo2-3 | |
| +--------------------------------------+ Zone 2 |
| | Compute node: argo2-2 | |
| +--------------------------------------+ |
| | Compute node: argo2-1 | |
| +--------------------------------------+ |
| |
| +--------------------------------------+ |
| | Compute node: argo1-4 | |
| +--------------------------------------+ |
| | Compute node: argo1-3 | |
| +--------------------------------------+ Zone 1 |
| | Compute node: argo1-2 | |
| +--------------------------------------+ |
| | Compute node: argo1-1 | |
| +--------------------------------------+ |
+------------------------------------------------------------+
B.4 Nodes
-----
o Two types:
* Master node (one)
* Compute nodes (sixteen)
B.4.1 Details about the master node
-----------------------------
o Clients log into master, compile programs, submit them for
execution on the compute nodes. Clients cannot log onto
compute nodes.
o Software on the master.
o Has access to both the outside world and the private network
which is one connection (there are two) among the compute
nodes.
o Hardware particulars
* 2U system box (1U = 1.75 inches)
* GA-7VTX motherboard (used with Athlon processors)
* AMD 1600 XP processor (1.4GHz Athlon) Not SMP, just one CPU
within.
% 128K L1 (cache memory on the chip)
% 256K L2 (cache memory on separate chip)
* 768MB DDR DRAM Memory (three 256MB DIMMS)
% DDR is double data rate. Twice as much can be transfered
because transfer occurs when the clock signal bounces from
LOW to HIGH but also from HIGH to LOW. (SDRAM transfers
only from LOW to HIGH)
* Dual NIC (Intel Etherexpress) (eth0/eth1)
% eth0: 172.16.0.2 (access to private network)
% eth1: 128.248.121.64 (access to the world)
* Two 80GB Maxtor IDE 5400 drives (hda/hdb)
% hda: master
% hdb: backup of hda
* CDROM (Sony 48X IDE)
* Floppy (Sony 1.44)
* Video card (Trident 9750 with 4MB of Synchronous Graphics RAM
(SGRAM)
* Adaptec 29160 external SCSI port
B.4.1.1 Rear view (master node)
-----------------------
+--------------------------------------+
| |
To KVM <- | M C E0 E1 ----> To switch
To KVM <- | K | | S |
+----------- | ------------- | --------+
| |
+--> To KVM +--> To ethernet hub
M: Mouse
K: Keyboard
C: Console
E0/E1: ports on the dual NIC
S: SCSI interface
B.4.2 Details compute nodes
--------------------- c
o Sixteen (all configured the same).
o Clients do not log in.
o Firewalled by master node.
o Hardware particulars
* 1U system box
* GA-7VTX motherboard
* AMD 1600 XP processor (1.4GHz) Not SMP, just one CPU within.
% 128K L1 (cache memory on the chip)
% 256K L2 (cache memory on separate chip)
* 768MB DDR DRAM Memory (three 256MB DIMMS)
* Single port NIC (Intel Etherexpress) (eth0)
% eth0: 172.16.X.X (access to private network)
* 40GB Western Digital internal hard drive (hda)
* Video card (Trident, same as master)
* Dolphin D335 Mother/Daughter combo interconnect cards
*** No floppy or CDROM drives (mistake) ***
Bob Hyman is working on a procedure to connect an external
floppy to the USB port for booting purposes (also for
rescue mode). Or, we will stick a floppy into each. Or,
we will do a network boot.
B.4.2.1 Rear view (same for each compute node)
--------------------------------------
Back - compute node
+--------------------------------------+
To KVM <- | M C E0 | D1 D2 | <--> Dolphin
To KVM <- | K | | | D3 D4 | <--> Dolphin
+----------- | --- | ------------------+
| |
To KVM <---+ +--> To ethernet hub
M: Mouse
K: Keyboard
C: Console
E0 NIC
D1/D2: Dolphin card (mother)
D3/D4: Dolphin card (daughter piggybacked onto mother)
B.5 Console and KVM switch (Keyboard/Video/Monitor)
-----------------------------------------------
o Master and compute nodes share a single console/mouse/keyboard
o Console
* Situated in the middle of the rack. Pulls out and screen
flips up.
* Screen size: 15.1
o KVM located at the top of the rack.
o KVM is a single bank Belkin 16 channel device. Bank number 0
(since we have only one bank).
o Argo1-1 is not attached to KVM (16 channel KVM but we have
seventeen computers-one master and sixteen compute nodes.
One of the compute nodes had to be left off).
B.5.1 Front view
----------
+--------------------------------------+
| +----------------+ |
| +--+--+ |123456789ABCDEFG| <--- Channels
Bank/Channel ----> |0 |X | +----------------+ |
Indicator | +--+--+ +----+ |
| |BS|C| <------ Toggle Buttons
| +----+ |
+--------------------------------------+
BS: Bank scan (does nothing since we have only one bank)
o To go to the next channel (node) in the channel list, strike the C
button.
o To jump to a particular machine, enter the following sequence:
For example, to jump to argo4-3, strike scroll lock twice,
followed by a zero, followed by the letter a (capitalization
is not required).
o When you switch channels, the name of the new node is
displayed at the top of the console for a couple of seconds.
o The current channel is displayed in two places
* In the box on the left, front side (contains an X in the above
diagram)
* Highlighted in the channel box, the one on the right front
side.
o Channel/Node Correspondence
* No logic to the correspondence.
* Channel Node Channel Node
1 argo1-2 9 argo4-2
2 argo2-2 A argo4-3
3 Master B argo2-3
4 argo4-4 C argo1-4
5 argo3-1 D argo2-1
6 argo3-2 E argo4-1
7 argo2-4 F argo1-3
8 argo3-4 G argo3-3
As you can see, no argo1-1.
B.6 Networks
--------
B.6.1 Overview
--------
o Most demanding communication requirement are not with the
external environment but with other nodes on the SAN (System
Area Network: network optimized for use as a dedicated
communication medium within a commodity cluster).
o Every node may need to interact with every other node,
independently or together, to move a wide range of data types
between processors.
* Data may be large blocks of contiguous information represeting
subunits of very large global data (need bandwidth).
* Data my be small packets containing single values or
synchronization signals to support collective operations
(need low latency)
B.6.1 Some Details
------------
o Two networks
* Fast Ethernet (thru SMC hub)
* Dolphin Interconnect
B.6.1.1 Fast Ethernet
-------------
Overview
o For out-of-band management (basically anything but
process communication: NFS, NIS, pvfs, etc).
o All machines, master and compute nodes, are connected
together thru it.
o Advantages
* Inexpensive
* Ubiquitous (drivers integrated into LINUX and well tested)
* Easy to support
* Gig E is backward compatible (mixed-mode)
o Disadvantages
* FE with TCP/IP provides 90-95 Mbps with latencies in the
hundreds of microseconds (very bad latency).
Some details about the FE in the argo cluster
o One NIC in a compute node
o Since UIC uses a class B address structure, I continued
that convention for the private network:
172.16 16 bit network address
X.X 16 bit host address
o Regarding compute nodes there is a relationship from hostname
to host address:
argoZONE-NODE 172.16.ZONE.NODE
* Examples:
argo4-4 172.16.4.4 argo4-3 172.16.4.3 argo4-2 172.16.4.2
argo3-4 172.16.3.4 argo3-3 172.16.3.3 argo3-2 172.16.3.2
...
So, for argo4-2, the zone is 4 and the host is two. Allows
one to look at the rack and know the IP address.
o The exception is the master node (two NICs):
eth0: 172.16.0.2 (access to private network)
eth1: 128.248.121.64 (access to the world)
No logic why I designated master 0.2 as host portion of the IP
address.
JackG gave me the IP address for eth1 (master). The 64 was
wishful thinking-we would eventually have 64 nodes.
If traffic becomes too much, JackG will replace the SMC hub
(broadcast device) with a switch (multi-port learning device).
B.6.1.2 Dolphin Wulfkit
---------------
Overview
o Dolphin is an SCI-based interconnect for Beowulf systems.
o Dolphin is hardware. Refers to a card (mother) and
a piggybacked daughter.
o Includes closed-source binary drivers and an
implementation of the MPI tuned for the SCI network. This
is the software. It is referred to as Scali.
o SCI is IEEE standard originally designed to provide an
interconnect for cache-coherent shared-memory systems.
(Cache coherency protocol: uniform view of the values
in memory.)
o Used for parallel process communication among compute nodes.
+-----------------------------------------------------------------------+
| |
| Some node Some other node |
| +----------------------+ +----------------------+ |
| | +-----------+ | | +-----------+ | |
| | | Process 1 | --msg-> Interconnect ----> | Process 5 | | |
| | +-----------+ | | +-----------+ | |
| | | | | |
| | +-----------+ | | +-----------+ | |
| | | Process 2 --------> Interconnect ----> | Process 8 | | |
| | +-----------+ | | +-----------+ | |
| +----------------------+ +----------------------+ |
+-----------------------------------------------------------------------+
Advantages
o High performance including latency
* Latency (delay): .25 - .5 microseconds.
% Gigabyte ethernet: 24 - 30 microseconds.
% Don't know if these vendor supplied latency numbers are
round-trip or one-way
* Bandwidth: 1 Megabyte per second.
Disadvantages
o Current PC motherboard chip sets do not support coherency
systems required to construct an SCI-based shared memory
Beowulf.
Some details
o Dolphin cards are only in compute nodes.
o Model D335 (mother with a piggyback daughter)
o PCI card: 32 bit/33 MHz
o Technical specification
* www.dolphinics.com/products/pci64_adapter_card.html
o Pictures
* www.dolphinics.com/placed/subpages/photos/top_D333_low.jpg
* www.dolphinics.com/placed/subpages/photos/2000WukfkitAbove.jpg
o Additional information
* www.dolphinics.com
* www.scali.com
Rear view (same for each compute node)
--------------------------------------
+--------------------------------------+
| M C E0 | D1 D2 | <-- Dolphin mother
| K | D3 D4 | <-- Dolphin daughter
+--------------------------------------+
Connectors D1 and D2 are on the mother; D3 and D4 are on the daughter
D1: In connection (mother) D3: In connection (daughter)
D2: Out connection (mother) D4: Out connection (daughter)
Mother card: Intraring communication (connects nodes in the same zone/
ring).
Daughter card: Interring communication (connects nodes in different zones/
rings).
Cables:
Blue label is out.
Yellow label is in.
Ring (zone) Ring 2/Ring 3 Ring 4
+-------------------------+ +-------------------------+
| Node M D | | Node M D |
| +---+ +------+ +------+ | | +---+ +------+ +------+ |
| |1.4| |In|Out| |In|Out| | ... ... | |4.4| |In|Out| |In|Out| |
| +---| +------+ +------+ | | +---| +------+ +------+ |
| |1.3| |In|Out| |In|Out| | | |4.3| |In|Out| |In|Out| |
| +---| +------- +------+ | | +---| +------- +------+ |
| |1.2| |In|Out| |In|Out| | | |4.2| |In|Out| |In|Out| |
| +---| +------- +------+ | | +---| +------- +------+ |
| |1.1| |In|Out| |In|Out| | | |4.1| |In|Out| |In|Out| |
| +---+ +------+ +------+ | | +---+ +------+ +------+ |
+-------------------------+ +-------------------------+
How each compute node connects to every other compute node through the
Dolphin cards ("Tim Eisler SNA job security diagram")
Two-dimensional SCI torus topography
------------------------------------
Ring 1 Ring 2 Ring 3 Ring 4
+-----------------+ +-----------------+ +-----------------+ +-----------------+
|Mother card-L0 | |Mother card-L0 | |Mother card-L0 | |Mother card-L0 |
|1-1 out -> 1-3 in| |2-1 out -> 2-3 in| |3-1 out -> 3-3 in| |4-1 out -> 4-3 in|
|1-3 out -> 1-4 in| |2-3 out -> 2-4 in| |3-3 out -> 3-4 in| |4-3 out -> 4-4 in|
|1-4 out -> 1-2 in| |2-4 out -> 2-2 in| |3-4 out -> 3-2 in| |4-4 out -> 4-2 in|
|1-2 out -> 1-1 in| |2-2 out -> 2-1 in| |3-2 out -> 3-1 in| |4-2 out -> 4-1 in|
+-----------------+ +-----------------+ +-----------------+ +-----------------+
|Daughter card-L1 | |Daughter card-L1 | |Daughter card-L1 | |Daughter card-L1 |
|1-1 out -> 3-1 in| |2-1 out -> 1-1 in| |3-1 out -> 4-1 in| |4-1 out -> 2-1 in|
|1-2 out -> 3-2 in| |2-2 out -> 1-2 in| |3-2 out -> 4-2 in| |4-2 out -> 2-2 in|
|1-3 out -> 3-3 in| |2-3 out -> 1-3 in| |3-3 out -> 4-3 in| |4-3 out -> 2-3 in|
|1-4 out -> 3-4 in| |2-4 out -> 1-4 in| |3-4 out -> 4-4 in| |4-4 out -> 2-4 in|
+-----------------+ +-----------------+ +-----------------+ +-----------------+
Pattern (intrazone - mother card connections)
---------------------------------------------
Node 1 connects to node 3
" 3 connects to " 4
" 4 connects to " 2
" 2 connects to " 1
Pattern (interzone - daughter card connections)
-----------------------------------------------
Zone 1 connects to zone 3
" 3 connects to " 4
" 4 connects to " 2
" 2 connects to " 1
Chart of the nodes and cabling (constructing from the above
information). Notice the differences between this chart and
the logical one given on page two.
+-----------------------------------------------------+
| +------+ |
| | | |
| | +---------+---------+---------+---------+ |
| | | argo4-2 | argo2-2 | argo1-2 | argo3-2 | |
| | +---------+---------+---------+---------+ |
| | | argo4-4 | argo2-4 | argo1-4 | argo3-4 | |
+---> +---------+---------+---------+---------+ ---+
| | argo4-3 | argo2-3 | argo1-3 | argo3-3 |
| +---------+---------+---------+---------+
| | argo4-1 | argo2-1 | argo1-1 | argo3-1 |
| +---------+---------+---------+---------+
| ^
| |
+------+
Get rid of the leading word argo and the dash
+----+----+----+----+
| 42 | 22 | 12 | 32 |
+----+----+----+----+
| 44 | 24 | 14 | 34 |
+----+----+----+----+
| 43 | 23 | 13 | 33 |
+----+----+----+----+
| 41 | 21 | 11 | 31 |
+----+----+----+----+
Now look at the above diagram with rows and columns
(xy coordinates)
+----+----+----+----+
x2 | 42 | 22 | 12 | 32 |
+----+----+----+----+
x4 | 44 | 24 | 14 | 34 |
Rows +----+----+----+----+
x3 | 43 | 23 | 13 | 33 |
+----+----+----+----+
x1 | 41 | 21 | 11 | 31 |
+----+----+----+----+
4y 2y 1y 3y
Columns
So there are four rows and four columns (in the Scali
documentation, they are referred to as rings). Later,
I will refer to this as the "golden list".
x2 4y
x4 2y
x3 1y
x1 3y
y dimension: nodes within the same zone
x dimension: nodes in different zones
All the above was to help explain routing
o Two types of routing in Scali
* Scali routing
* Dimensional
o Scali routing
* Fault tolerant algorithm
* Capable of maintaining full connectivity among the
remaining nodes even if more than one node has failed.
* When all nodes are working, it is equal to dimensional
routing (XY or YX).
* This is what we're using.
o Dimensional routing
* All motion in the first dimension must be done before
routing in the next dimension
* Two types: XY or YX
% XY
+ X is the first dimension and Y is the second dimension.
+ Will go to other nodes in a different zone.
+ Default
% YX
+ Y is the first dimension and X is the second dimension
+ Will go to other nodes in the same zone.
* Not fault tolerant - you will lose all nodes in the
X or Y dimension if one node goes down. That's not the
case for Scali routing.
Procedure to determine how many nodes remain available in
an "up" state on the interconnect if we lose a node (using
Scali routing which is fault-tolerant).
Step 1: Identify the bad node on the cabling grid.
Example: argo4-1.
+----+----+----+----+
x2 | 42 | 22 | 12 | 32 |
+----+----+----+----+
x4 | 44 | 24 | 14 | 34 |
Rows +----+----+----+----+
x3 | 43 | 23 | 13 | 33 |
+----+----+----+----+
x1 | BAD| 21 | 11 | 31 |
+----+----+----+----+
4y 2y 1y 3y
Columns
Step 2: Identify in the cabling grid the row and column of
which the bad node is a member.
row: x1
column: 4y
Step 3: Identify in the cabling grid the other node members
in the row and column containing the bad node.
Row: argo2-1, argo1-1, and argo3-1 are in
the same row (x1) as the bad node (argo4-1)
Column: argo4-3, argo4-4, argo4-2 are in the
same column (4y) as the bad node.
Step 4: Remove from the "golden" list of columns and rows the
row and column containing the bad node.
Golden list Afterwards list
x2 4y x2 --
x4 2y x4 2y
x3 1y x3 1y
x1 3y -- 3y
Step 5: Regarding the other node members in the row and column
containing the bad node, see if each member has access
to at least one column or one row in the "Afterwards
list". If so, the node remains in the "up" state in
the interconnect.
Nodes in row x1: argo2-1, argo1-1, and argo3-1
Nodes Member One of them Status
still in list
argo2-1 x1 2y Yes-2y up
argo1-1 x1 1y Yes-1y up
argo3-1 x1 3y Yes-3y up
Nodes in column 4y: argo4-3, argo4-4, argo4-2
argo4-3 x3 4y Yes-x3 up
argo4-4 x4 4y Yes-x4 up
argo4-2 x2 4y Yes-x2 up
The other nine nodes are not impacted by the loss
of argo4-1 and are not routed around.
Step 6: Total the nodes in "up" status (15) and redraw the
interconnect diagram.
+---------+---------+---------+---------+
| argo4-2 | argo2-2 | argo1-2 | argo3-2 |
+---------+---------+---------+---------+
| argo4-4 | argo2-4 | argo1-4 | argo3-4 |
+---------+---------+---------+---------+
| argo4-3 | argo2-3 | argo1-3 | argo3-3 |
+---------+---------+---------+---------+
| argo2-1 | argo1-1 | argo3-1 |
+---------+---------+---------+
The result is the loss of only one node. Scali will
route around node argo4-1.
What happens if you lose two nodes? Use the same procedure
as outlined above.
Step 1: Identify the bad nodes on the cabling grid.
Example: argo4-1 and argo3-2.
+----+----+----+----+
x2 | 42 | 22 | 12 | BAD|
+----+----+----+----+
x4 | 44 | 24 | 14 | 34 |
Rows +----+----+----+----+
x3 | 43 | 23 | 13 | 33 |
+----+----+----+----+
x1 | BAD| 21 | 11 | 31 |
+----+----+----+----+
4y 2y 1y 3y
Columns
Step 2: Identify in the cabling grid the rows and columns of
of which the bad nodes are members.
Bad Node Row Column
argo4-1 x1 4y
argo3-2 x2 3y
Step 3: Identify in the cabling grid the other node members
in the rows and columns containing the bad nodes.
Argo4-1 Row: argo2-1, argo1-1, argo3-1
Column: argo4-3, argo4-4, argo4-2
Argo3-2 Row: argo4-2, argo2-2, argo1-2
Column: argo3-4, argo3-3, argo3-1
Step 4: Remove from the "golden" list of columns and rows the
rows and columns containing the bad nodes.
Golden list Afterwards list
x2 4y -- --
x4 2y x4 2y
x3 1y x3 1y
x1 3y -- --
Step 5: Regarding the other node members in the rows and
columns containing the bad nodes see if each member
has access to at least one column or one row in the
"Afterward list". If so, the node remains "up" and
in the interconnect.
For argo4-1
Nodes in row: argo2-1, argo1-1, and argo3-1
Nodes Member One of them Status
still in list
argo2-1 x1 2y Yes-2y up
argo1-1 x1 1y Yes-1y up
argo3-1 x1 3y No down
Nodes in column: argo4-3, argo4-4, argo4-2
argo4-3 x3 4y Yes-x3 up
argo4-4 x4 4y Yes-x4 up
argo4-2 x2 4y No down
For argo3-2
Nodes in row: argo4-2, argo2-2, and argo1-2
Nodes Member One of them Status
argo4-2 Already down - see above
argo2-2 x2 2y Yes-2y up
argo1-2 x2 1y Yes-1y up
Nodes in column: argo3-4, argo3-3, argo3-1
argo3-4 x4 3y Yes-x4 up
argo3-3 x3 3y Yes-x3 up
argo3-1 Already down - see above
The other four nodes are not impacted by the loss
of argo4-1 and argo3-2; they remain up.
Step 6: Total the nodes in "up" status (12) and redraw the
interconnect diagram ("it's the Chevrolet logo")
+---------+---------+
| argo2-2 | argo1-2 |
+---------+---------+---------+---------+
| argo4-4 | argo2-4 | argo1-4 | argo3-4 |
+---------+---------+---------+---------+
| argo4-3 | argo2-3 | argo1-3 | argo3-3 |
+---------+---------+---------+---------+
| argo2-1 | argo1-1 |
+---------+---------+
The result is the loss of four nodes. Scali will
route around them. Nodes 4-1 and 3-2 are bad. Nodes
4-2 and 3-1 are collateral damage.
Here's the resulting working interconnect if we lose three
nodes: argo4-1, argo3-2, and argo2-3:
+---------+
| argo1-2 |
+---------+---------+---------+---------+
| argo4-4 | argo2-4 | argo1-4 | argo3-4 |
+---------+---------+---------+---------+
| argo1-3 |
+---------+
| argo1-1 |
+---------+
Here's the resulting working interconnect if we lose three
nodes: argo4-1, argo3-2, argo2-3, and argo3-4:
+---------+
| argo1-2 |
+---------+
| argo1-4 |
+---------+
| argo1-3 |
+---------+
| argo1-1 |
+---------+
Status (more on this in a separate presentation)
------
Two ways to get the status of the interconnect:
1) GUI (scadesktop)
2) Command mode
Because of the current hardware problem, the interconnect status via the GUI
is not available. Here is command mode:
On node argo4-4:
/opt/scali/bin/scinfo -l
ScaSCI (Scali SCI driver) v.2.4.9 (m), adapter 0, type D335, nodeid 0x1000
Current number of unterminated open calls: 4
Link hardware info for adapter 0
-------------------------------------------------------------------------
LC[0]: up 68166.536 s, enabled sr 0xa5ab fw 0x1a07 hwid 0xffee
LC[1]: up 68166.236 s, enabled sr 0xa5ab fw 0x1c07 hwid 0xffed
Link event counters, adapter 0
-------------------------------------------------------------------------
LC[0] LC interrupts: 3
LC[0] tot. sw.initiated resets: 1
LC[0] upstream error detections: 3
LC[0] software reinitializations: 4
LC[1] LC interrupts: 3
LC[1] tot. sw.initiated resets: 1
LC[1] upstream error detections: 3
LC[1] software reinitializations: 4
On node argo2-1 (the one causing the fuss)
/opt/scali/sbin/scinfo -l
ScaSCI (Scali SCI driver) v.2.4.9 (m), adapter 0, type D335, nodeid 0x0000
Current number of unterminated open calls: 4
Link hardware info for adapter 0
-------------------------------------------------------------------------
LC[0]: down enabled (hardware problem)
LC[1]: down enabled (hardware problem)
LC[2]: down enabled (hardware problem)
Link event counters, adapter 0
-------------------------------------------------------------------------
LC[0] tot. sw.initiated resets: 3425
LC[0] sw reinit failure count: 3424
LC[0] sw reinit failure resets: 342
LC[1] tot. sw.initiated resets: 3425
LC[1] sw reinit failure count: 3424
LC[1] sw reinit failure resets: 342
LC[2] tot. sw.initiated resets: 3425
LC[2] sw reinit failure count: 3424
LC[2] sw reinit failure resets: 342
Here is the actual error message that no one can figure out:
scaconscaconfsd: ScaliRouting.cpp:663: bool Mesh::SetFailedRing(const class ScaString &, int): Assertion `dim < 2' failed.
More on Scali in a different talk.
C. Turning on/off the cluster
--------------------------
C.1 Power is on - rebooting the entire cluster
o Run the "/sbin/shutdown -r" now command on the master node -
the master will reboot the cluster nodes as a part of its
startup procedure (see the following section from the /etc/rc.local
file):
#
# Prevent users from logging in to the master node since we're about to
# reboot the compute nodes
#
/bin/cp /etc/nologin.msg /etc/nologin
#
# Reboot the compute nodes
#
for loop in 4 3 2 1 ; do
for loop1 in 4 3 2 1 ; do
rsh -l root argo$loop-$loop1 /sbin/shutdown -r now
done
done
echo "Sleeping for 6 minutes while the cluster nodes boot"
/bin/sleep 6m
C.2 Powering on the entire cluster.
o Start with master and then power on each node (4 3 2 1) in
each of the four zones.
C.3 Shutting down and Powering off the cluster
o Two steps:
+ Run the "/sbin/shutoff -h" command on the master node (see
below)
#! /bin/sh
act="r"
if [ "$1" ]
then
if test "$1" = "h" || test "$1" = "H"
then act="h"
fi
fi
for loop in 4 3 2 1 <== zones
do
for loop1 in 4 3 2 1 <== nodes
do
`rsh -l root argo$loop-$loop1 /sbin/shutdown -"$act" now`
done
done
+ Run the "/sbin/shutdown" -h command on the master to shut it
down.
Obviously, the reverse procedure will not work. And, unless
there are no other options, don't strike the power or reset
buttons.
C.4 Rebooting a single compute node
-------------------------------
Not a good idea. May impact the PVFS.
D. Operating System
----------------
D.1 Overview
o Came from RackSaver with Linux 7.1 (kernel 2.4.2-2)
o Kernel on all machines upgraded to 2.4.9-31
* Two reasons for upgrade
% Had to patch routine i387.c with fix from Scali.
% Incorporate Athlon enhancements
o General procedure
* Used gcc_2.96-85
* RPM to get the source
* Patch to modify the i387.c routine
* make mrproper
* make menuconfig
* make dep
* make clean
* make bzImage
* make modules
* make install
* make modules
* make modules_install
* Modified /etc/lilo.conf
* lilo -v
We did this for all seventeen machines. There is a handy product
called SystemImager that allows one to create a golden image of
a kernel and then export it to other nodes on the network. I've
yet to learn it.
o The /etc/lilo.conf on the master:
boot=/dev/hda
map=/boot/map
install=/boot/boot.b
prompt
timeout=50
message=/boot/message
linear
default=linux-2.4.9-31c
image=/boot/vmlinuz
label=linux-2.4.9-31c
read-only
root=/dev/hda5
image=/boot/vmlinux-new
label=linux-new
read-only
root=/dev/hda5
image=/boot/vmlinuz-2.4.2-2
label=linux
read-only
root=/dev/hda5
E. Partitions and filesystems
--------------------------
E.1 Master
o Overview
Two types of filesystems
* Linux ext2
* Parallel Virtual File System (pvfs-separate talk)
o Partition Layout
Disk /dev/hda: 255 heads, 63 sectors, 9729 cylinders
Units = cylinders of 16065 * 512 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 1 7 56196 83 Linux (boot)
/dev/hda2 8 9729 78091965 5 Extended
/dev/hda5 8 40 265041 83 Linux (root)
/dev/hda6 41 564 4208998+ 82 Linux swap
/dev/hda7 1089 1742 5253255 83 Linux (tmp)
/dev/hda8 2006 3051 8401995 83 Linux (opt)
/dev/hda9 3576 4621 8401995 83 Linux (usr)
/dev/hda10 5146 5669 4209030 83 Linux (var)
/dev/hda11 6718 9729 24193890 83 Linux
No partitions on the master for the parallel filesystem.
o The above layout is not how the machine came from RackSaver.
Bob Hyman and I developed and executed a procedure to
repartition (expand and contract partition sizes) as well
as to add partitions without reinstalling the OS.
o Size information
* The size of hda6 was based on having at least 4X of memory
for swap (768MB * 4). And then some wiggle room.
* Gap between hda6 (swap) and hda7 (tmp) for swap expansion or
new partitions
* Gap between hda7 and hda8 to allow for tmp expansion.
* Gap between hda8 and hda9 to allow for opt expansion.
* Gap between hda9 and hda10 to allow for usr expansion.
* Gap between hda10 and hda11 to allow for var expansion.
Even though some of these partitions are unnecessary for
a compute node, we elected to include them (for example,
opt).
o If we get the raidzone or some such device, then hda11 (home) is
available.
o Filesystem particulars
* Filesystem 1k-blocks Used Available Use% Mounted on
/dev/hda5 256667 176745 66670 73% /
/dev/hda1 54416 7687 43920 15% /boot
/dev/hda9 8270100 1516604 6333400 20% /usr
/dev/hda10 4142832 52412 3879972 2% /var
/dev/hda7 5170696 284 4907752 1% /tmp
/dev/hda8 8270100 167800 7682204 3% /opt
/dev/hda11 23814136 55600 22548844 1% /home
argo4-4:/pvfs-meta 96393600 4898304 91495296 6% /scratch4
argo3-4:/pvfs-meta 96393600 4898304 91495296 6% /scratch3
argo2-4:/pvfs-meta 96393600 4898304 91495296 6% /scratch2
argo1-4:/pvfs-meta 96393600 4898304 91495296 6% /scratch1
The four scratch areas are the parallel filesystems (separate
lecture).
E.2 Compute nodes
-------------
o Overview
Three types of filesystems
* Linux ext2
* Pvfs (/scratchX)
* NFS (home)
o Partition Layout
Disk /dev/hda: 255 heads, 63 sectors, 4865 cylinders
Units = cylinders of 16065 * 512 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 1 7 56196 83 Linux
/dev/hda2 8 4865 39021885 5 Extended
/dev/hda5 8 40 265041 83 Linux
/dev/hda6 41 406 2939863+ 83 Linux
/dev/hda7 407 1015 4891761 83 Linux
/dev/hda8 1016 1381 2939863+ 83 Linux
/dev/hda9 1382 1578 1582371 82 Linux swap
/dev/hda10 1579 1775 1582371 83 Linux
/dev/hda11 1776 1782 56196 83 Linux
/dev/hda12 1783 4830 24483028+ 83 Linux
o Filesystem particulars
* Filesystem 1k-blocks Used Available Use% Mounted on
/dev/hda5 256667 142535 100880 59% /
/dev/hda1 54416 9094 42513 18% /boot
/dev/hda8 2893628 1517968 1228668 56% /usr
/dev/hda10 1557464 30032 1448316 3% /var
/dev/hda6 2893628 76852 2669784 3% /opt
/dev/hda7 4814936 172 4570176 1% /tmp
/dev/hda11 54416 4 51603 1% /pvfs-meta
/dev/hda12 24098424 416 22873860 1% /pvfs-data
argo.cc.uic.edu:/home
23814136 55600 22548848 1% /home
argo4-4:/pvfs-meta 96393600 4898304 91495296 6% /scratch4
argo3-4:/pvfs-meta 96393600 4898304 91495296 6% /scratch3
argo2-4:/pvfs-meta 96393600 4898304 91495296 6% /scratch2
argo1-4:/pvfs-meta 96393600 4898304 91495296 6% /scratch1
o Differences between master and compute nodes regarding the layout
of partitions and filesystems
Partition Master Compute
+--------+---------+---------+
| hda1 | boot | boot |
| hda5 | root | root |
| hda6 | swap | opt | *
| hda7 | tmp | tmp |
| hda8 | opt | usr | *
| hda9 | usr | swap | *
| hda10 | var | var |
| hda11 | home | pvfs | *
| hda12 | ---- | pvfs | *
+--------+---------+---------+
On the compute nodes, home is an NFS-mounted filesystem and does not
need a partition.
For reasons to be discussed in the PVFS lecture, there are required
partitions on the master for the parallel file system.