It's now 1:29AM, Jan 12, 2008. Approximately 48 hours ago, I was running a RAID resync on the main file server for the company. It was late and the process was going to take about 250 minutes (~4 hours). That's a long time to wait, so I decided to go home. I returned 8 hours later to find the entire system unusable. I panicked. This is the first time that I truly panicked... I tried my best to get it working, but it seemed futile. The system stopped working and I was s.o.l. From what I gather, the RAID was already in degraded mode, hence the need to resync... but before the resync could finish, another drive failed, so the whole RAID set was gone. RAID5 only allows for 1 failed disk. 2 failed disks is catastrophic!
In the 28 hours that followed, I have never been up that long! The trip to Italy was 23, and even then I got to sleep on the plane... I brought up a new server using CentOS, installed all necessary daemons, restored backups, and copied anything else that was needed. It was a marathon of file transfers totaling about 350GB! After 28 hours of non-stop file transfering, configuring, and tweaking, the system seemed ready for use. It had to be, the company could not afford any more downtime... People were back to using the system, and I got to go home to sleep.
As I look back on this episode, I get the feeling that the days of scrambling will be no more. No doubt management will now pay premium to have high-availability (or even clustered!) systems. The days of free rides are over. It saddens me a bit that the skills I posses will now be passed over in favor of outsourced solutions. My days as help-desk / tech support / network engineer / system admin / programmer / systems analyst seems numbered... Only time will tell.
But in the mean time, I can prepare myself by learning new skills. I have to adapt to changes that are taking place. Hmm... now, where is that Cal State catalog?...
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment