Brown M&Ms

May 27, 2023

I gave a short presentation yesterday about understanding the general health of a software engineering organization. It’s a longish presentation – a thirty-second version might read “Technology last. Start with risk management, change management and disaster recovery, next look at staff (training,
process, culture, turnover) next, auditing the actual tech situation last”. But I wanted to share the “Brown M&M” questions with you.

If you haven’t heard the Brown M&Ms story, it’s detailed here, but the short version is that Van Halen had one seemingly-throwaway line in their stadium contracts saying there had to a bowl of M&Ms backstage, but all the brown ones needed to be removed. Failure to meet the arbitrary “no brown M&Ms” rule of these spoiled, self-indulgent rock stars was supposedly grounds to cancel the whole concert and, the story went, you could expect them to trash the venue on their way out the door. But, of course, that wasn’t the whole story.

That one seemingly-spurious demand was a contract canary. Van Halen was the first band to bring a huge show to second-and third- tier markets, specifically second- and third-tier facilities that might not be able to handle the electrical demands or pyrotechnics safety standards or even the weight of the stage. The Brown M&Ms clause was an easy way for the band to check if the venue was paying attention to the details of the contract; if they find no M&Ms, or one brown M&M, they’d know for sure that they to re-check everything to make sure nobody got injured or killed.

The Brown M&Ms I mentioned, when I’m sizing up a software shop, are pretty straightforward.

  • What’s the most interesting thing you’ve discovered in a retro recently, and what did you do about it?
  • How do you manage pager duties?
  • How do you validate a rollback?
  • What would you fix next if you had the resources?
  • Do you use agile processes, and can you describe them?

You might not have great answers to all of these, but organizations that don’t have decent answers to any of them are in trouble.

The logic behind the “interesting retro” question is: does your organization do retrospectives at all? Do managers and executives – does anyone – read them, and what happens then? Which is another way of saying, do you have a culture that champions learning and continuous improvement as a matter of course, or do you have … that other thing?

The point of the pager question is really a management question. Functionally, obviously, a pager is a device that activates a human when a piece of software can’t solve its own problems. As a question, it’s a proxy for “Do you really know who owns all your services, and what the real degree of urgency is of supporting them when they fail? Do do you understand – do you really in-your-bones understand – that that those people need sleep, rest, and time off?”

I’ve worked at places that hand out pagers without mature processes around on-call rotation, hand-off, sleep planning and time-off-for-real. Those companies destroy people.

Real talk here: if you’re running a service that requires a human to carry a pager, and the person who wrote that software isn’t carrying that pager, you’re not “doing DevOps”, you’re doing service contract management in a terminal window. If you want to drive service reliability, the people who write the service need to be the same people getting dragged out of bed at Oh My God O’Clock on Christmas morning when it falls over. People write very different code when they know they’ll be the ones responsible for all those sharp edge cases, and aligned incentives are magic.

The “how do you validate a rollback” question is a fast way to do a deep dive into change management practices. As in, do you have them at all, how much do you trust them, how do you gain confidence that you can recover in the event of process failure. Most software orgs now have something approximating mature change management, or at least had that forced on them by the popularity of GitHub in the forward-motion sense. Less than you’d hope actually have, much less test for, partial completion or abortive-recovery scenarios.

“What would you fix next if you had the resources” might as well read “does your org have a shared understanding of its backlog and priorities”, but nobody will say “lol no” to that question. But if you get a bunch of different answers to this from different people or teams, it’s very likely your organization has leadership and communications gaps that need urgent attention. Even “how would you make that decision” is informative; can you draw a straight line from organizational goals to the the top of your backlog? What would you need to have, for you to have that?

Finally, the agile question is one I’m most proud of, because agile hasn’t meant anything concrete for a long time beyond “who needs a plan when you can get interrupted and reorged at any time at the whim of managerial fiat.” Does this org actually have an ethos, much less a plan, for managing people, projects and resources at a high level? Maybe, God help us all, that will let people focus on hard problems and get them all the way solved? Or are we just playing management-shibboleth buzzword bingo with our people, and we don’t really know how or why anything happens.

Anyway, I hope you find some of that useful. Thanks for coming to my TED talk.

(Remember when we thought TED talks were important? Wasn’t that weird?)