Tuesday, May 26, 2009

Mac Boot Mysteries

This is long, digressive story about diagnosing a hardware problem on a Mac; if you dislike such stories, feel free to leave now.

About a week ago, my wife Candy complains to me that her Mac won't boot up. This is my hand-me-down Mac (we have a new policy in our house: Candy gets my hand-me-down computers, and I get her hand-me-down cameras), which means that it's about 2 years old, but it has a relatively new hard drive that I installed last November. A long time ago, I had set the startup option to always run in Verbose startup mode (on demand by holding COMMAND-V upon startup, or permanent by issuing the following command:

sudo nvram boot-args="-v"

Anyway, I could see from the startup porn that she was having a kernel panic with 2 likely suspects: the fan control daemon and something about Cisco VPN. Now, Candy doesn't have a Cisco VPN, but given that this was my hand-me-down machine, that explains why some of that stuff is there. Candy hadn't installed anything in the last week or so, leading me to think that one of these two was the culprit. She had been complaining that her machine was getting slower and slower, including things like window resizing, which had me puzzled. Perhaps a dying fan was causing the processor to overheat and thus slow down?

I tried safe boot (no joy), and at this point I suspect the fan. I'm certainly not afraid to crack open a Mac (with proper respect), but replacing a fan isn't high on my list of fun things to do, so we made an appointment at the Genius bar. To Candy's credit, she had a SuperDuper! backup that was just a couple of days old, so virtually everything was safe.

We went to the genius bar where the GenX slacker (this is a compliment) booted the Mac from an external drive. I hadn't tried this (even though I have several bootable drives laying around) because I was fixated on the fan problem. After booting it up, his suspicion now lies with the VPN stuff, and I reluctantly concur (especially after he ran some fan diagnostics). Now, though, the question remains: why did this problem suddenly occur? What was his (depressing) advice to fix this problem? Reinstall Leopard and all your applications. What?!? Is this a freakin' Windows machine? I couldn't believe that was real Genius advice. I've never yet had to do a ground-up reinstall of everything, but if that's the only way...hmmmm. He was very knowledgeable, and obviously he doesn't tread in the realm of VPN stuff. He also correctly pointed out that a bad fan shouldn't cause slowness: redrawing windows is mostly handled by the GPU on the Mac. The slowness was as far as I can tell a red herring.

When I got home, the first thing I did was boot Mac OS X from an external drive and get a real SuperDuper! snapshot, getting the real current snapshot. Once I have that, I can play. Candy has already agreed to the pain and degradation of reinstalling everything, but I have to think there's a better way. Then, I had a brain storm: I took the SuperDuper! snapshot I just made and booted the machine from the external drive. Success! That suggests that some part of the internal hard drive that houses the VPN stuff has somehow gotten corrupted, but still allows it to boot using the same image from an external drive. Because I have the SuperDuper! safety net, I decided on an experiment. I reformatted the internal hard drive and ran Drive Genius on it to scan for bad sectors. Nothing of note came from that, but then I overlaid my most recent SuperDuper! snapshot back onto the internal drive.

Success! The internal drive now boots, and everything appears back to normal. I'm guessing that my bad sector theory was correct.

Lessons:

  • Don't reinstall everything! My record is still clean on that account: I have never had to do that on a Mac yet (and it was a once-a-year chore on Windows because of bit rot).

  • Always have good backups. This would have been a tragedy rather than a comedy if Candy hadn't been using SuperDuper!. It has yet to let me down, and it has saved my bacon on several occasions.

  • I immediately latched onto the fan because it seemed to support other observed phenomena. I should have booted it myself from an external drive and run Drive Genius, but I thought I had it figured out.

  • Stop and think. It was a good thing that we had dinner plans with a neighbor when we got back from the genius bar. It was over dinner that I had the idea of just overlaying the snapshot again. If I had started on it as soon as we got back, I would have been creating a lot of movement without a lot of forward progress. Sitting and thinking about it opened my mind to alternative options.

  • SuperDuper! rocks. I can't imagine life without it.

1 comment:

Unknown said...

Neal, the problem may have been due to a corrupt HFS+ partition. At one point, I had a problematic Mac Book Pro; it wouldn't reboot. I thought to myself, oh heck, it's a unix running a journaled FS, I'll just shut down hard when I need to reboot and fsck will take care of it. WRONG! I ended up with a partition that couldn't boot up the day before a talk I was giving at OSCON.

After surviving the week by using a SuperDuper! backup from an external drive, I was able to format/restore and all was well. That was how I learned that HFS+ can't handle unclean shutdown, journaled or not.