blarg?

Cascading Hardware Failure

So, we’ve got these two clustered Oracle servers, right? And since they’re running Oracle, they must be for something big and expensive, so we’ve got to test them with our application to make sure that if some part of them fails, that our applications either handle that nicely or die as quickly and cleanly as possible. Because having bad data hit the disks as a result of some hardware failure would be bad, you know?

Bear in mind, of course, that this is end-of-day friday, and I’ve got to catch me a plane to get back home soon.

So, clustered servers have this “heartbeat” cable between them, so they can both feel warm and fuzzy knowing that the other one is right there next to them being supportive in a load-balanced, atomic-transaction, servery way. So, we reckon, what happens when you pull that cable while you’re running our super-important application?

Holy crap, it’s exciting.

So, I pull the cable, and one server very promptly dies. Over the course of twenty seconds or so the Oracle app locks up, then the UI locks up, and then the screen turns blue and tells me that 0x0000FFFF means something horrible has happened.

Then shortly after that, our cab doesn’t show up to get us to the airport, so we have to figure out how to cross six lanes of highway and flag one down on the other side. When we do, it’s already got a trunk full of luggage and a passenger in the front seat who looks extremely serious and doesn’t say anything. We pile into the back with our suitcases, the meter is never turned on, and we drive to the airport in silence, paying the guy the fifteen-dollar airport minimum.

Then we go the self-checkin, and the ticket-printer widget tells us it’s printing our tickets and then it crashes too. On our second attempt, the machine tells us we’re troublemakers, and to go see the counter, and now we’re a little behind, so we hustle.

Then we get to the plane on time, and I have just enough time to find out that my iPod and Gameboy are both out of juice before we get sent out to sit in the plane on the runway for an hour while we wait to be de-iced.

And the flight turned out to be one of the jitteriest, shakiest rides I’ve ever been on, bar none. Plane, boat, carnival, Canada’s Wonderland, mechanical bull, whatever. The landing was, in particular, the most exciting I’ve ever been a part of, and everyone on board was visibly a little shaken.

Then we almost got trapped in an elevator in Union Station, as it dropped half a foot and jerked us around for a few exciting moments before it consented to let us out at our floor.

So, anyway, I’m back home now, and with a bit of luck nothing else wi##.
@.!!%@
$#&#@
$#@$#@+++NO CARRIER+++

2 Comments | Skip to comment form

  1. Quotation

    It’s a good thing that you went directly to your blog as soon as you got home from your trip. If you’d gone to your wife, you’d likely have broken her, too.

  2. Adam

    Am I the only one who finds it blatently obvious that these problems begin and end with the fact that you were using Oracle instead of say, DB2?

    Oracle is brutal at scheduling cabs and flying air planes (and don’t even get me started on their elevator algorithms). DB2 on the other hand, would have got you home in less than half the time without any de-icing, surley cab drivers, or elevator “irregulartities”.

    It’s just that good.