The intent of this document is to provide a starting point for organizations to plan for the installation of a cluster. In fact the idea is to help you develop and organize your plans in order to make sure you cover all bases from interconnect technology to backups from test-bed to production. In some cases we have left in information from the OSC cluster to help in understanding how this document could be used. We reference Appendix A for identifying your specific configuration plans and Appendix B where you start planning your costs (these are site specific documents and are not included).
The Center Acquisition and Support Plan for a Production Cluster
The Center Cluster Vision
How and who will use the cluster.
System Plan
How will the system be selected and purchased.
Production Impact Assessment
Reason for Assessment
For each item listed below assess the requirements on whether
the cluster is ready for production use. Review staffing support, costs and
other items of note.
Initial Move to Production
- System Hardware Configuration needed for production
-
- The current configuration with 4 nodes (8 CPUs) is not large enough to
support a great number of users. System expansion is needed to be able to
support more than a 1 or 2 jobs and to support larger memory jobs.
-
- This expansion will cost x$ and will require X man hours of administrator
time to install, configure and test. The complete configuration details are
in Appendix A.
-
- Since the hardware will not be delivered as a complete system, The
Center can not develop a global acceptance criteria. The Center staff
will be the final authority on the installation readiness.
-
- Switch / Network Interconnect
-
- We recommend acquiring Myrinet SAN as the network interconnect for the nodes. This network must be expanded to support the new configuration. The number of nodes the network can support is currently 8.
-
- Budget
-
- See Appendix B
-
- Compilers & Libraries
-
- The following compilers need to be installed and available on the
cluster:
-
- Our preference are the compilers and tools from the Portland Group (PGI).
PGI provides an integrated suite of tools and compilers that make the
environment extremely easy to use.
-
- There is some cost associated with the compilers, depending on which compilers are used. Although there are GNU compilers available, even these have costs to install and support. To have a common programming environment with other scientific computational platforms at the Center, compilers from the Portland Group and / or Kuck and Associates may also need to be used.
-
- Using compilers from GNU will require the Center to research and update the compiler levels. This is expected to average X man hours a month. This number may be slightly higher during OS upgrades.
-
- Purchased compilers should have support and update information available. This will reduce the support level from the GNU compiler levels, to X man hours a month.
-
- Scientific libraries are needed. These should include support for BLAS, IMSL, and NAG. Investigation into the types, availability and costs of these libraries needs to be completed.
-
- 3rd Party Software needed for production
-
- Third party software is not required for production use of the system. The system is being positioned to take over the Gaussian job support, so it is expected that Gaussian should be available. Additional third party software requirements will be driven by the user workload.
-
- We have site licenses for the following 3rd party software. From this site licensed software, the following have versions that will run on the cluster: xxx, yyy, zzz.
-
- The following software should be on the cluster: package AAA, price $$$, package BBB price $$$, etc.
-
- Supporting third party software should not be much different than it is on current systems. It will be an additional platform for the software librarian to maintain.
-
- Space for system
-
- Moving to production levels of 128 CPU (32 nodes) or even 256 CPUs (64 nodes) would require an additional rack for every 4 nodes. (based on 4 processors per node)
-
- Each rack takes up 4 sq. feet of floor space, plus power and cooling. The Center needs to plan for the additional space and environmental support.
-
- Administrative Workload
-
- Moving the system into production will increase the system administrative (SA) workload. Currently the system is being supported part time by one SA. Two deep support knowledge is needed. Until backup SA support is up to speed, this system should be in a 8-5 x 5 day a week support mode. No after hours support should be required. Plans should be done to complete SA training within 3 months. At which time, the system can move to 24 x 7 support.
-
- There should be some discussion about the level of 24x7 support. Since the Center will be providing all software and administrative support for the cluster, a limit to the after hours support should be understood by the Center and users alike. Is travel to the site required to resolve after hour problems? Do other activities take precedence over cluster support?
-
- The primary administrator needs to take these issues and define policies and procedures for the cluster. In addition adding the system to current Center support environments must be completed. These activities and others will take 50 to 75% of the primary SA time in the 2-3 months leading up to production support.
-
- Backups / Recovery
-
- The cluster backup and recovery needs for user data should be handled by current procedures storing a backup file under the Mass Storage System. The backup and recovery procedure needs to be defined, tested and implemented. Backup and recovery procedure information must be made available to SAs and to the help desk. This should only take a few hours, at most, of SA time. No additional costs are expected.
-
- Backup and recovery of system disk areas must be defined. Are all the systems the same configuration? Will it be faster to recover by re-installing the system? Is there a broot available for fast recovery? Once again, the system disk recovery procedures need to be defined and made available to the SAs. This should only take a few hours, at most, of SA time. No additional costs are expected.
-
- Initial Users
-
- Usage of the system should be initially limited to a small set of "friendly" users that want to run on the new platform. These users should be willing to put up with unexpected system outages and changes to the hardware and software configurations.
-
- No expense is expected to be associated with supporting the initial users. There may be more time spent by the help desk supporting these users.
-
- Data / Diskspace
-
- User home directories will be NFS mounted to the cluster systems from a central SAN disk subsystem.
-
- Since each of the cluster systems is an independent OS, with its own disk space, temporary disk space must be defined and accessible to each node. This can be done be defining /tmp space for each node and NFS exporting it for each system in the cluster. This is slow and administratively awkward. This also limits the size of /tmp and the files that can be used there.
-
- A common /tmp, using some type of global filesystem, is the preferred alternative. /tmp needs to be large enough to support the many, large files generated by Gaussian. The minimum /tmp size for the cluster is 100 GBs. To get to this level will cost no more than $10,000.
-
- Documentation for
-
- Users
-
- User documentation needs to be available for basic use of the machine. These include: job submittal, job cancellation, how to monitor a job, user and system limits. Also documentation compilers, libraries and 3rd party software should be available. Documentation for procedures is also needed. This includes how to restore a file and account credits. This documentation should be available on-line.
-
- User notification for system downtimes and maintenance needs to be added to the existing procedures.
-
- Support Staff
-
- In addition to the documentation noted above, documentation for the help desk and SAs is needed for system backup and recovery, account management, maintenance procedures and support procedures.
-
- This documentation will take some time to assemble and develop. It is expected that to take X% of a SA for X weeks to assemble and develop.
-
- Course Development
-
- Courses taught about using the cluster would be nice but are not necessary for the initial production use of the machine.
-
- Operations Support
-
- Cluster production use will require operations support. This support will be limited to monitoring a system console and notifying the on-call SA of problems with the system.
-
- Development of perl, awk and TCL/tk scripts to monitor the system and notify the operators of problems will need to be done. Basic monitoring and notification should be done for the following areas: Nodes/CPUs, backplane network, HIPPI network, ethernet network, disks, filesystem availability, reboots, backup failures and failure of the queuing system.
-
- This will take some time to develop. Probably 1 to 3 weeks. There should not be any additional costs, other than SA time.
-
- User Resource Management
-
- For a small number of users, resource management should not be a big problem. As usage grows we will want to tie in user management to the tool set provided for the user.
-
- Accounting
-
- Accounting will need to run on the cluster. CPU accounting records will need to be processing daily. Account restriction support procedures will need to be developed for the cluster.
-
- Security
-
- Since the base operating system of the cluster is LINUX, an OS that is used in many places, it will be the subject of many probes for security holes and possible exploitation. The SAs for the system will need to keep up with monitoring the system. The system will also need to have security patches applied regularly.
-
- This will could (should) occupy one to four hours of time for an SA per week.
-
- Maintenance
-
- Maintenance procedures for this system will differ greatly from other computational platforms. Since this system will not have one vendor for all support, the Center will be responsible for all system problem reporting, tracking, diagnosis and resolution.
-
- Reporting
-
- A procedure for problem reporting will need to be established. Use of existing tools is possible. The procedures should be developed and documented before production support begins. The information should be given to users and administrators.
-
- Follow up procedures with users to close out reports needs to be done.
-
- Problem Tracking
-
- Problem tracking procedures need to be developed. These procedures should be used to track user and staff reported system problems. The problem tracking procedures should also track the problem to resolution, as well as prioritize problems. It should also allow some method of problem pattern matching.
-
- Finally the resolutions to problems should be stored with the tracking system to allow staff to research and resolve future problems.
-
- Hardware
-
- Most Centers are not in the position to do more than basic swap out support. In an effort to reduce the SA load, hardware support could be contracted out.
-
- Until all production issues are resolved, 8-5 support, M-F should be enough support for the cluster. Contracting for this support should cost about $XXX for a 128 node system. This contract for support should be in place before production begins.
-
- Software
-
- The Center should be responsible for all OS and other software support. For purchased packages, such as compilers, libraries and 3rd party software, the Center will need to report the problems to the vendor and track responses until resolution.
-
- This could take a significant amount of time. Plan on 1 to 5 hours a week for support of software on the cluster.
-
- Help Desk
-
- The help desk support procedures will need to be modified for cluster problem tracking. Otherwise, the normal help desk procedures will be followed.
-
Ongoing Support
- Administrative Workload
-
- There will be a workload issue for administrators for this system. Security, upgrades, problem tracking & resolution, as well as hardware and software maintenance support will all draw time from an SA. Since support of components of the system will come from many disparate areas, an SA will need to spend time monitoring product information for the hardware and software.
-
- On-call procedures still need to be maintained and probably adjusted. The Center expects that the equivalent of 1 FTE will be needed for this system to be maintained in proper production support.
-
- Compilers & Libraries
-
- After the initial production installation, normal support is expected. Additional libraries may be added to production use, based on user and system requirements.
-
- 3rd Party Software needed for production
-
- After the initial production installation, normal support is expected. Additional codes may be added to production use, based on user requirements.
-
- Backups / Recovery
-
- No changes are expected after the initial production setup.
-
- Users
-
- User support requirements will change as more users are added. Monitoring this area and making appropriate modification will take some SA time. This should be planned for.
-
- Data / Diskspace
-
- Permanent disk space support should not change much. Temporary disk space use will need to be evaluated based on utilization. A global filesystem will be needed. Disk quota support and installation should be considered.
-
- Documentation
-
- Documentation will need to be updated and maintained. It is difficult at this point to determine the level of effort needed. Based on support for current systems, this will take approximately 2 hours a week, on average. There will be peaks and valleys of support, depending on changes to the system and support procedures.
-
- Course Development
-
- This will depend on the system utilization. User community and software selected.
-
- Operations Support
-
- Operations support should not differ greatly from other systems, once the appropriate monitoring and reporting procedures are developed and installed.
-
- User Resource Management
-
- Control of batch processing will be necessary. We will also begin to see issues with trying to support a growing interactive workload. We will need procedures to deal with these issues. The Center should consider install the Portable Batch System (PBS) available via the WEB. A commercial product Load Sharing Facility (LSF) is available.
-
- As many automated procedures as possible should be available to assist the SA.
-
- Security
-
- Monitoring of security issues will need to be done by the SA. Tracking of problems, and applying the appropriate patches will be needed. Failure to maintain standard SA procedures will affect the production use of this system.
-
- Maintenance
-
- Hardware maintenance will need to be reviewed to see if 24x7 support is required. Otherwise normal hardware support procedures developed earlier should be maintained.
-
- The problem tracking system should be reviewed and updated.
-
- Software maintenance will take up more time of the SAs. OS, compiler and library upgrades will need to be synchronized across the cluster. Other problem resolution activities will need to be maintained.
-
- Help Desk
-
- Normal help desk procedures are expected once production has begun and the cluster has been integrated into the support structure.
-
Updating the system
- System Hardware Configuration needed for production
-
- With the fast changing processor capabilities, keeping the system up to date may prove to be a challenge. Changes to the hardware will need to be done on a periodic basis to keep the system from being dismissed as outdated.
-
- The processors in the cluster will probably have a functional life span of less than 24 months. Upgrades will be a necessary evil for this system. It may also be an expensive necessity. Proper re-allocation of the existing processors will need to be done.
-
- Planning for support and funding of these "fast track" upgrades will need to be done.
-
- Space for system
-
- With the current trends in node and processor development; additional floor space requirements will be minimal. Unless, a major system expansion is planned. The expectation for more CPUs in a node should drop the rack space requirements.
-
- Administrative Workload
-
- The administrative workload during system updates could be quite high. Determining the hardware components to upgrade to and trying to ensure they will support the network, current OS and software environment will take some time. The actual hardware upgrades will also require SA time.
-
- Compilers & Libraries
-
- Compiler and library upgrades will need to be matched to system hardware and OS levels. Some testing and evaluation will need to be done.
-
- Backups / Recovery
-
- Minor additional impact on backups. Recovery from old versions could prove troublesome.
-
- Data / Diskspace
-
- No problems expected.
-
- Documentation
-
- Documentation will need to be updated to reflect all system upgrades.
-
- Course Development
-
- Difficult to gauge impact at this point.
-
- Operations Support
-
- Minimal changes expected.
-
- User Resource Management
-
- No impact expected.
-
- Security
-
- No additional impact expected.
-
- Maintenance
-
- Hardware changes may affect the maintenance support contracts. All changes to hardware will require notification of the support vendor.
-
- Help Desk
-
- Problem tracking procedures must be closely followed during and immediately after any system upgrades.
-