As anyone who is on a bikelist.org/phred.org mailing list knows the mail server has been down for two days. This same server hosts this blog, my personal email, personal email for most of my family, and nearly a dozen other random websites. This is a Windows 2003 Server and is running Exchange and SQL Server.
On Thursday (11/17/2005) I was doing some routinestuff (installing service packs, etc). After one reboot the system came back up but many key services wouldn’t start and were reporting that important system files were corrupt. I started a full chkdsk (fsck for unix folks) and there were tons of errors, including some in the drivers for the RAID controller. After the chkdsk the system no longer booted.
My first thing was to boot to the recovery console to see if the most important stuff (Exchange and SQL databases) were intact. Sadly by default the Recovery Console only lets you look at the files in the systemdirectory (normally c:windows). There is a KBarticle on changing this, but you need to boot the system to do that. 
I tried using the repair installation mode of Windows Server. This half worked, but when it went to reboot the system would blue screen. The error was “INACCESSIBLE_BOOT_DISK”. My boot volume was on a disk controlled by a 3Ware 6410 controller. This is a RAID controller which uses standard IDE (desktop/consumer) disks but can use multiple of them to build a single volume. I ran it in RAID 1 (mirroring) which means that two disks were exact duplicates of each other, so that if one failed the other could keep running. Windows Server doesn’t include drivers for the 3Ware controller and the drivers from 3Ware aren’t signed (so they haven’t been tested by Microsoft) and the repair installation didn’t seem to fix this. So the system couldn’t get very far into the boot process because it couldn’t talk to the disks.
At this point it was about midnight, and I knew I had work tomorrow, so I tried to go to bed. Before going to bed Istarted installinganother copy of Windows Server (this is really handy to have around during disk crashes) so that I could get to my data.
At 1:45am I still hadn’t gone to sleep, so I woke up to look at what was going on. The install had worked and I could see that the important data was in good shape. I tried copying the 3Ware drivers from the new install over the corrupted ones on the original install,but it still wouldn’t boot. I spent about an hour poking around the registry and trying to make it work with no luck and went back to bed (at around 3am).
At 7:15am I woke up (I can’t sleep in for the life of me) and thought about it some more. The 3Ware controller isn’t really that helpful for me since Windows Server can do the mirroring in software just fine. Since the drivers aren’t signed and the logical disk corruption could have been caused by the 3Ware controller drivers (or many other things) I decided to stop using it for my boot volume.I drove into work (bringing the server with me) and once I got there I started to copy the entire volume to a normal drive plugged into one of the motherboard controllers. This took a very long time,mostly because the archives consist of hundreds of thousands of tiny little text files that take a while to copy. I used XXCloneto do the copy because it is supposed to leave the disk in a bootable state when it is done.
When the copy was done Iremoved the RAID controller and tried to boot of the new disk. The computer’s BIOS told me that it couldn’t find a disk to boot from. When I went to check the boot order it said “hard disk (not installed)”, but when I checked the list of devices it did show a hard disk. So I ran Windows Repair mode again to see if this could make the disk bootable. It still didn’t boot and at this point I was starting to think that computer itself was having issues. I was also at work, so I put everything aside and worked for a few hours.
On my way home I was thinking about going to Fry’s to buy a new computer. Luckily for me traffic was horrible and so I took the direct route home instead of going to Fry’s and back which would take me through some of the worst traffic areas around. On my way home I realized that I could just run the server on my wife’s computer and wouldn’t need to buy a new one this weekend. I got home and started heading down that path.
This is when I discovered that the disk still wasn’t bootable. I put the disk from her computer into the server machine it and was bootable, so I knew the server machine was still okay. I installed another fresh copy of Windows onto the disk and finally it became bootable. At this point it was 6pm on Friday and I had been working on this on and off for about 20 hours. I finally booted the machine back into the original Windows installation and everything seemed to be working. Yay!
I spent the next 5 hours fixing some remaining corrupt files, installing service packs, and putting the machine back together. At 11:30pm I put the server back into it’s normal home in my basement, announced to the world that everything was working, and went to bed.
I woke up at 9ish (yay for finally getting some sleep) and things weren’t working. I forgot to set a critical networking variable on the machine before going to bed, the address of my router. The router could sent packets to the server, but the server didn’t know how to send them back out. This meant that I could see it from my FreeBSD box sitting right next to the computer, but nothing else on the network (including computers on a different network in my house) could see the server. I figured this out and got my first piece of spam. Always a good indication that you are back online.
There are still some things to do:
- We’re running on a single unmirrored disk right now. It is also a couple of years old, so I don’t really trust it. I’m going to buy another disk or two today or tomorrow to mirror with it. This should be my only expense in this huge mess, mostly the whole thing took a lot of time.
- I need to get backups going again. The disks that I normally back up to are sitting in a pile on the floor and need to be reinstalled.
- I need to make it so that the Recovery Console lets me get to my whole disk in case I need that in the future.
I’m glad to be back online.
alex