November 26, 2020
As always, I am inexplicably carrying a deep seated personal grudge against anyone incurious enough to start with “because in C” when you ask them why computers do anything, but bear with me here. I know that a surprising amount of modern computing is definitely Dennis Ritchie’s fault, I get it, but even he existed in a context and these are machines made out of people’s decisions, not capricious logic-engine fae that spring to life when you rub a box of magnets with a copy of SICP. Quit repeating the myths you’ve been spoon-fed and do the research. A few hours in a library can save you a few decades in the dark.
Anyway, on a completely unrelated note: Today in wildly-unforeseeable-consequences news, I have learned where null-terminated strings come from. Not only are they older than Unix, they’re older than transistors and might be older than IBM. It turns out at least one of the roads to hell is paved with yesterday’s infrastructure.
“Coded Character Sets, History & Development” by Charles E. Mackenzie is an amazing technical artifact, approximately 75% mechanical tedium and 25% the most amazing deep-cut nerd history lesson I’ve seen in a while. I’ve gone on about how important Herman Hollerith and his card-readers were in the history of computing, but this document takes that to a much lower level, walking meticulously through the decision making processes by which each character’s bit sequence was determined, what constraints and decisions arose and how they were resolved, and where it isn’t boring as hell it is absolutely fascinating.
As an aside, it continues to be really unfortunate how much historical information about the decisions that have led up to modern computing is either hidden, lost or (I suspect most commonly) just ignored. There is a lot to learn from the Whys of so many of these decisions, lessons about process that transcend the implementation details lost in favour of easily-retellable falsehoods, and there’s an entire lost generation of programmers out there who found ESR’s version of the Jargon File before they found their own critical faculties and never quite recovered from that who’ll never learn those lessons.
(I mention it because I’m going to be talking about EBCDIC, and if your gut reaction to seeing that acronym is a snide dismissal for reasons you can’t really elaborate, I’m talking about you. Take it personally. Like a lot of Raymond’s work and indeed the man himself, his self-interested bastardization of the jargon file is superficially clever, largely wrong and aging very badly. Fortunately the Steele-1983 version is still out there and true to its moment in history, you’re still here to read it and a better future is still possible. I believe in all but one of you.)
On a less sarcastic note, there is a lot in here.
I didn’t realize quite how old ASCII is, for one. I mean, I knew the dates involved but I didn’t grasp the context, how completely disconnected the concerns of modern computing were from the creation of the ASCII standard. The idea of software, or decision-making implemented in software as relevant consumer is almost completely a non-issue, mentioned in passing and quickly brushed off. Far and away the most important concerns – as with EBCDIC, BCDIC and PTTC before it and dating back to the turn of the century – were about the efficiency of collating punch cards and fast line printing on existing tooling.
Printing, collating and backwards compatibility. In terms of importance, nothing else even came close. The idea of “code” as an information control-flow mechanism barely enters into it; that compilers exist at all is given a brief nod at the start of chapter 25 of 27, but otherwise it’s holes-in-cardboard all the way down. You can draw a straight line back in time from Unicode through a century of evolving punchcard standards all the way to the Hollerith Census Tabulator of 1890; Hollerith has cast an impossibly long shadow over this industry, and backwards compatibility with the form, machinery and practices of punch cards are entirely the name of this game and have been forever.
It’s also amazing how many glyphs in various degrees of common use across various languages and systems were used, reconsidered and discarded for some wild variety of reasons as encodings evolved; the “cent” symbol giving way to a square bracket, various useful symbols like logical-not getting cut without any obvious replacements. Weird glyphs I’ve never seen on any keyboard in my life getting adopted then abandoned because they would have caused a specific model of long-established tape storage system to crash. The strangely durable importance of the lozenge character, and the time a late revision of BCDIC just … forgot “+”. Oops?
I had no idea that for a while there we were flirting with lowercase numbers. We were seriously debating whether or not computers needed a lowercase zero. That was a real thing.
Another thing I didn’t realize is how much of a dead end ASCII is, not just as a character set but as a set of practices that character set enables:
*char++, I’m looking at you and all your footgun friends. I was never much of a student, but am I misremembering all that time I spent sorting and manipulating strings with tools that have wound up somewhere on the “merely obsolete” to “actively dangerous” spectrum, unsafe and unportable byproducts of a now-senescent encoding that nobody uses by choice anymore?
It’s really as though at some point, before about 1975, people just… hadn’t fully come to terms with the fact that the world is big. There’s a long chapter here about the granular implementation details of an industry struggling to come to terms with the fact that Europe and Asia actually exist and use computers and even if they didn’t sorting human text is a subtle problem and encodings aren’t the place it’s going to get solved, only for that discussion to get set shunted aside as the ghosts of compatibilities past shamble around the text rattling their chains.
There’s also a few pages in there about “Decimal ASCII” – basically “what if ASCII, but cursed” – a proposal with a such powerful Let Us Never Speak Of This Again energy that almost no modern references to it exist, high on the list of mercifully-dodged bullets scattered throughout this document.
But maybe the most interesting thing in here was about how much effort went into sorting out the difference between blank, space, null, zero and minus zero, which turns out to be a really difficult problem for all sorts of reasons. And the most incredible part of that is this:
Null-terminated strings were “produced by the .ASCIZ directive of the PDP-11 assembly languages and the ASCIZ directive of the MACRO-10 macro assembly language for the PDP-10”, per Wikipedia, before manifesting themselves in C. But that’s not where they come from.
In fact null terminated strings existed long before C, because using a column of unpunched entries in a Hollerith card – a null column, in a convention that apparently dates to the earliest uses of punchcard collating and sorting machines – to indicate that you could terminate the card’s scan, was a fast, lightweight way to facilitate punch-card re-use and efficient data entry and re-entry, when that data was entered by punching it into cards.
That is to say, we somehow built the foundations of what would become a longstanding security exploit vector decades before anyone could build the operating systems it could exploit. Likely before Long before the invention of the transistor, even.
It’s sort of amazing that anything ever works at all.