blarg?

DeathRaceCondition 2000

Danger

I had a race condition eat my week. Stick around, I’ll tell you what that’s about.

Here’s some IRC, with some comments.

10:29 < mhoye> Oh, man. I think I've finally fixed this bug.
10:30 < mjschranz> mhoye: What was the bug?
10:30 < mhoye> Multi-user, multi-stack web services are the most horrible thing.
10:30 < mjschranz> mhoye: Sounds like a nightmare.
10:32 < mhoye> mjschranz: When we built this Firefox customizing thing, we built a pretty clever way to scan a directory tree and construct an .MSI file out of the result.
10:33 < mhoye> The software stack is a trail of tears. web -> php -> python -> wine -> mono (the .net emulator) -> emulated DOSSHELL commands  and then all the way back.

I promise you it’s necessary. I’ve stared at this stuff for a long time, and there’s no other way. I overlooked bash and MySQL in that list, too. Go, me.

10:34 < mhoye> And at one point, we decided to hardcode in some work directory names.
10:36 < mhoye> the MSI-building tools have this awful bug in them such that the absolute path of a Linux filename, as munged through that stack and turned into a dosshell filename, has to be less than 96 characters.
10:37 < mhoye> So we put everything in /tmp/m - Nice and short, and Firefox's tree is never so deep that we'll get close, so we should be OK.
10:37 < mhoye> But it's a multiuser system.
10:37 < mhoye> What happens if two people happen to try building stuff in /tmp/m at the same time?
10:38 < mhoye> I was staring at that code the other day, on the verge of giving accounts to clients when it dawned on me.

If there’s one thing I’m proud of here, it’s that I saw this in the code before a customer saw it on their screen.

Imagine you’re wrapping gifts for a departments store. The process is simple: you take the gift, you get a box, you put it in the box. You wrap the box, you hand it to the client. Nothing to it.

But imagine if two people are using the space to do that, at the same time; you’re both racing to see who gets to which step first, to see who gets to use that space. And because these two people aren’t people at all, but processes on a computer, they don’t just nod, wait and take turns of their own initiative; they just follow the process. And whoever “wins” the race, getting there first, gets their workspace flattened as the other process moves their own work into it.

You take a gift. The other guy takes a gift. Both of you find a box that fits your gift, and one of you put it on the table first. And then the second guy puts theirs on the table.

What happens then? Does the second box get put in the larger, or on the smaller, first one? Does it just get brushed aside?

And then it gets worse, because then the gifts get dropped into the mix. Maybe a too-large gift smushes the smaller box, maybe both get put in the same box, maybe the small one gets put in the larger box before being crushed under the larger gift. Maybe one of them got wrapped already, if one of the processes thinks it’s that far ahead, and then gets a new, unwrapped gift dropped on top of it.

Shortly whatever mess is left on that table gets handed by one of the processes to one of the clients. You can never be sure which one, or what state that package is in, but you’re virtually guaranteed that it’s not something the client wanted. And better yet, the other robot can then turn around and maybe wrap up a bunch of empty space where the gift used to be, and hand that empty wad of wrapper off to their customer. Assuming some third client hasn’t come along to use that space at the same time. Or maybe forty more clients.

That class of problem is what’s called a “race condition“, and they’re some of the most subtle, pernicious and difficult problems in any kind of process management, software or otherwise.

10:39 < mhoye> the _best case_ scenario there is that it all just goes to hell.
10:39 < mhoye> Everything breaks for no obvious reason.
10:39 < mhoye> That's the best case.
10:39 < mhoye> The worst case is that one of my clients is given _somebody else's_ .MSI.
10:43 < mhoye> The real problem turned out to be "finding everywhere we've hardcoded that directory name, and figuring out how to pass that variable around".
10:44 < mhoye> And it was hidden in some places that, in retrospect, seem obviously nuts.

I’m sure all of this would have been easier if I was smarter.

One of the worst was finding that we’d put “-om” in the arguments we pass to a decompression tool, which actually mean “output to directory m”, not “using optional feature om”. Instead of taking a few extra minutes to define a variable somwhere and build the command string. Stupid, stupid. We did it months ago, too; I looked right at it when we did, with both eyes, I know better than to do that, and I did it anyway. My eyes skittered over that line for hours without latching onto that, either; embarrassing.

10:47 < mhoye> Anyway, rookie mistake solved.
10:47 < mhoye> But the lesson here is that knowing whether or not 2+ people will be using a program at the same time is a profoundly important design decision.
10:48 <@humph> so true
10:50 < jbuck> mhoye: 0, 1 or infinity users :)
10:51 < mhoye> And nobody gets paid for zero, and it's rare to get paid for one, yeah.
10:58 < mhoye> Man.
10:59 < mhoye> Git commit and git push have never felt this good.

One more reminder that “works for me” isn’t worth all that much. I’m going to chalk this one up to experience, and man, I’m glad this week is nearly done.

8 Comments | Skip to comment form

  1. Alex Rootham

    _it’s rare to get paid for one,_ – I wonder if the rabid adoption of virtualization will change this.

  2. Alex Rootham

    … ok that comment probably needs explanation: if virtual servers are easy-come easy-go, then why would you ever have to share virtual servers? Only if there’s something you fundamentally have to share on the server, which means only in cases where it _has_ to be a multi-user paradigm in the first place.
    …. and now I’m scaring myself by extrapolating to the idea that every virtual server should be single-tasked (task in the sense of general thing to do, not those things that block on I/O)

  3. mhoye

    When I mean only one user, I really mean “only one client”. But yes, it’s abundantly clear now that servers should be virtualized and single-purpose whenever it’s sanely possible to do so.

    Servers are the new services.

  4. Mike Beltzner

    Good analogy on the race condition.

  5. Mike Kozlowski

    If that’s the software stack you need to use to solve a problem, I have to think it might be better just to leave it unsolved. I mean, at least that way you only have one problem.

  6. Mike Hoye

    If it was just my problem, absolutely. But if it’s a problem lots of people will pay me to solve for them, that’s different.

  7. Oration

    My poor never-actually-fix-anything brain just started working out ways to work around this with chroot()ing. Damn you.

    And there’s very, very good money to be made as a conslutant in taking works-for-one and turning it in to works-for-two. And if you do a good job at that, you’ll make better money when they call you back to make it works-for-three and works-for-ten.

    Trust me, I know.

    I’ve seen consultants refuse to go beyond a certain number of little-bit-more contracts because they just got too bored.

  8. Rich Y.

    Everybody loves a race.

    War story with JavaScript and a Rails point update:
    http://www.kalzumeus.com/2011/11/17/i-saw-an-extremely-subtle-bug-today-and-i-just-have-to-tell-someone/

    Even if you invest the time to do it the right way, they will still get you, you just get better odds. I think it’s one of the laws of thermodynamics.