ACCC Home Page ACADEMIC COMPUTING and COMMUNICATIONS CENTER
Accounts / Passwords Email Labs / Classrooms Telecom Network Security Software Computing and Network Services Education / Teaching Getting Help
 

News - Sep 2006

   
 
     
router reload problem
 

Sep 27 2006 - We're having a router reload problem that is affecting several servers and services. Working on it.

 
     
Mailserv - problem on one of the back-end servers
 

Sep 22 2006 - We have a disk array problem on one of the mailserv back-end servers. Working on it. Some mailserv users may not be able to access their mail.

 
     
Tigger is Back
 

Sep 20 2006 - Tigger has been up since roughly 1AM, and so far appears normal. Even if it is now stable, and the signs are good, there will still be a bit of fallout to clean up. (Not to mention, deferred sleep compensation for the programmers working on this for 2 days.)

We now believe the orignal problem was due to a broken index on some key system files, which caused the system to slow to a crawl. A subsequent crash managed to corrupt some other files, and that corruption caused hard-to-diagnose problems on Tue.

We've restored an overall system configuration from Sept 11, and will adjust the differences manually. (User files are unaffected by this restore. They should be up to date.) So there may be a little fallout from this adjustment, but things are looking good right now.

Please report any further problems to consult@uic.edu

 
     
Tigger -- downtime Tue night
 

Sep 19 2006 - 1:30pm We now anticipate tigger being down Tue night, starting about 5pm for more troubleshooting.

 
     
Tigger -- continuing issues
 

Sep 19 2006 - 12:30 I may have spoken too soon. The fixes we applied yesterday helped a lot, and when tigger is up most things do seem to work. But we still have trouble with the database, and with overall stability. There may be more unscheduled reboots during the day, as we continue troubleshooting.

 
     
Tigger mostly back
 

Sep 19 2006 - Tigger is mostly back and available. Mail seems to be flowing. However, we still have some problems with our database, and this will affect the ability of the CSO to fix account issues, as well as various web pages including password changes, account creation, and so on.

 
     
tigger up again - we're very sorry for the lost day
 

Sep 18 2006 - Tigger is up again at 9:45pm

We're very tired now, so just a brief summary for now

Sometime Sunday afternoon, something (we still do not know what) caused tigger logins to be very slow and/or hang. We spent the night running diagnostics on the hardware because of an error message that eventually was found to be a red herring. The bottom line is that the problem was caused by a very obscure bug in an IBM program that makes indexes that are supposed to improve the speed of password authentication on very large systems (tigger has 10,000 user accounts).

The bug has apparently been present since we upgraded to AIX 5.3 last summer (this may also explain why tigger performance has not been good since the semester start). After spending an entire day working on the phone with IBM level 1 support discovered and applied a patch to bring us from AIX 5.3.0.4 to 5.3.0.5. It turns out IBM discovered the bug in August and just came up with a patch about 2 weeks ago.

We sincerely apologize for the long outage and any related problems caused by this event.

 
     
tigger still unstable - logins almost impossible
 

Sep 18 2006 - 11:30AM We're working with IBM to resolve problems with tigger. A few aspects seem to work, but most do not.

 
     
tigger up but still unstable - logins and web pages slow
 

Sep 18 2006 - NOTE: Please be patient for you ssh or telnet sessions to start - you should get in eventually.

Tigger was down much of the night and we are still experiencing various performance problems, primarily slow logins and web pages. We're still trying to track down the root cause and will post here when we know more. Here is a posting from the sysadmin who stayed up all night with the problem:

---

Service Express (our hardware maintenance proviode) was here, and we fiddled around in diag for quite a long time. What he saw was a duplicate entry for one particular memory simm. He deleted it, and marked the memory "repaired" in diag. Then we rebooted, and it came up normally!

There had been a message upon some of the earlier reboots that made me suspicious of microcode, so I went to IBM's microcode site, and I think I found the smoking gun, in the description for a new microcode version 3K060626 for the model 7038-6M2 computer which was just released a couple weeks ago on August 21, 2006:

"A problem was fixed that was causing enhanced error handling (EEH) error codes to be erroneously generated when certain adapter card configurations were heavily stressed by the application code."

(I wish I could find out what those "certain" adapter card configurations are!)

So, our conclusion is that we hit a rare microcode bug, and that what he did to clear it, did indeed solve the problem, but that we need to schedule a microcode update for tigger real soon or it could happen again.

This was one of the most difficult to solve problems I have ever encountered, and I'm still not confident we identified what was broken or how what we did fixed it. I'm not even very sure whether it was hardware, software, or microcode.

 
     
tigger down
 

Sep 17 2006 - Tigger crashed Sunday evening. Unfortunately, it failed to fully reboot. We have no diagnosis or prognosis yet.

 
     
blackboard down for upgrade
 

Sep 16 2006 - Blackboard will be down from 8 PM Sat. Sep. 16 to Noon Sun. Sep. 17 for a hardware upgrade. We hope to be done before Noon but we are allowing extra time in case problems occur. This upgrade will move the Blackboard application to a server which is much faster than the current server.

 
     
Small DNS outage
 

Sep 15 2006 - One of our DNS servers lost a disk. There may be some apparant outage, although most programs should failover to redundant servers. We'll have the disabled server replaced in a few minutes.

 
     
BLACKBOARD DOWN FOR MAINTENANCE
 

Sep 04 2006 - Blackboard is down for emergency maintenance to implement recently suggested configuration changes from Blackboard Inc. support and to add additional server hardware.

We apologize for the lack of warning and hope to be back up around 2am.

 
     
Blackboard's performance
 

Sep 01 2006 - We have significantly improved Blackboard's performance, but it is not back to normal yet. It is an intermittent problem that only manifest itself, sometimes, when lots of users are on the system. Several people from ACCC and from Blackboard Inc. continue to work on its resolution.

 


   JGS
UIC Home Page Search UIC Pages Contact UIC