The bumps on the road to the future.

Kozlowski is probably right, even if what he has to say kind of sucks. On a vaguely related note, well, read on.

A reinstall, with the subsequent Tweaking To Get Things Just Right is largely what’s brought this little diatribe on.

One of the thing that strikes me about the World Of Computers is how much of it is make-work. Not make-work in the more common “hiring an idiot relative” sense, but certainly in a “reimplement existing stuff from scratch” sense. Working on a couple of toy projects and being surrounded by other toy projects as I am, I see this a lot and if some conversations I’ve had with actual professional developers are any indication, it’s pandemic.

One of the toy projects I’m helping out with right now is Ben’s OC-Transpo scheduling project, and my humble contribution is going to be to take the data that he has to work with, currently in a bizzare variety of alien formats, and converting it into YAML, because a bunch of tools that I’d rather not have to, and frankly am probably not smart enough to, reinvent already exist to manipulate that data in a number of languages. I’m doing the same thing with the header-slash-data files in JWZ’s xkeycaps for the same reason – so that I end up with something a little more portable and hopefully a lot less brittle than anything home-rolled.

I mean, that work’s been done. It’s been tested, other people who are probably smarter than I am rely on it. Do I need a new data format? Probably not. Would I need to reimplement all the related tools that come with it? Probably. Is there a limit to the number of parsing, searching and sorting algorithms I feel like reimplementing in my life? You’re damn right there is.

And yet, how often does this happen? Apparently, all the time. JWZ recently posted a particularly egregious example, but just trying to get this machine set up properly has had me messing with a dozen other microformats that, I’m sure, all need to be parsed in their own special way because, hey, that’s a great way for programmers to be spending their time.

As far as I can tell, the only
valid reasons you’d have for inventing your own data format are:

  1. No established format exists that can handle your data (which I’m going to call the “wildly unlikely” option) or that

  2. You don’t want anyone else to be able to work with your data, known as “vendor lock-in” in most circles.

I guess you could just be doing all this with C/C++ and not feel like looking around for relevant libraries, and I guess that’s OK; once you’ve decided you’re just going to bang rocks together, it’s hard to get that strictly wrong.

Of course, this is just data, small fish. Simple. But people are hard at work reimplementing codecs, compilers, even entire infrastructures with the same general mindset, and the result is the same brittle incompleteness writ varying degrees of large.

I have only my impressions here, but I read some blogs, I read some news, and I get a sense that the corporate interests of the world, who free-software advocates typically refer to as “the bad guys”, have pretty much decided that it’s time to stop fucking around. And on the desktop at least, the Free Software side of things is, if Havoc’s assessment is any indication, hosed for the forseeable future. And given that it’s 2004 and I can’t reliably cut and paste shit between applications if it’s any more complicated than seven-bit ASCII, I’m sorely tempted to believe that.

I tell you, when I read about how lisp machines had garbage collection back in the late 70’s, I want to cry.

That and, apropos of nothing, I think that a good way to learn what to avoid when you’re programming is just to pay attention to the kind of things that piss old hackers off.

10 Comments | Skip to comment form

  1. Mike Kozlowski

    It amuses the hell out of me that at the same time you’re inveighing against pointless proliferation of data formats, you’re using a data format that was invented because some people decided XML (which is probably the world’s most universally supported data format by now) wasn’t efficient enough for them.

    I suspect that imaginary efficiency needs are behind most of the non-standard data formats. “Oh, that provides more generality than we need and requires us to use this huge library, why not just use this tightly-crafted, easily-parseable format?”

  2. Mike Hoye

    I’m against pointless one-off data formats, yeah, but I’ve got no problems picking one that isn’t XML as long as the tools are there. In this case my two choices aren’t XML or my-own-private-silliness, it’s just format-with-tools-A v. format-with-tools-B. And for what I’m doing, B makes a lot more sense. Which is kind of the point of this whole exercise; find an established format which makes sense for what you’re doing and run with it.

    For all its wondrous utility XML fails two pretty basic requirements for config files, those being easy human-eye-read/writeability and, if you’re using a strictly-compliant parser, a complete and deliberate lack of any kind of fault-tolerance, which is just a mind-blowingly awful way to live. I dare you to take a look at GNOME’s XML config files v. the nice name-value-pair KDE config files and then tell me that the XML is easier to work with when things need to be fixed.

    Ultimately, though, it doesn’t really matter what the data format is; it could be anything from bsd-mbox to berkeley-db to the Xresource format, whatever that’s called. The point I’m trying to make is that the Right Thing To Do is reuse some, any established format that’s close enough to what you need rather than building your own.

  3. Mike Kozlowski

    Well, I think the draconian error-handling is a plus, inasmuch as if I fucked up the syntax, I want to know about it, rather than try to figure out what arcane semantics it’s ascribing to my fucked-up syntax.

    To the rest, the main reasons that I prefer XML for config files from a user perspective (as opposed to a developer perspective) are that: 1) I’ve got the tools (XML Spy at times when someone else was paying the software bill; Emacs with nxml right now (which is really nice if you’ve got a RELAX NG schema); and VS.Net if I want (which is really nice if you’ve got a WXS schema)) to edit the file on a syntactic level above “this is text”; and 2) I know the freakin’ syntax rules already. If I see an XML file, I know what I can and can’t type, I know how to put in comments, I know what to do if I need to enter special Unicode characters, I know how to escape otherwise-special characters. If I see a non-XML file, I need to figure it all out again by reading the docs (if they mention it) or trial and error (if they don’t).

    I’m not religious on the issue, but my belief in 2004 is that if you’re going to create a data file, you need to have a solid affirmative reason why XML won’t work for you before going with anything else; and if that reason is “efficiency,” you’d better have some fucking compelling numbers.

  4. Ben Zanin

    I don’t want to wade into some kind of YAML-vs-XML or XML-versus-the-slavering-horde flamefest, but there are a couple of points I’d like to address. Note that the context in which all of my arguments are couched is the OC Transpo project. I am not making generalized statements – some may indeed be, but I will not assume the onus of defending them as such.

    Draconian error handling: big minus in this case. A good chunk of the source data comes from just over 6200 semi-regular screen-scraped HTML pages. If there’s an obscure error somewhere in the aggregate file, I want it reported, but I really don’t want the parsing to halt. I’ll take a batch report of errors and work with that instead of incrementally fixing one markup error at a time.

    XML tools: XML may be great if you’re working in a pipeline or a round-tripping to and from the same application, but it’s less fun to take semistructured data and turn it into clean and valid XML. With this project, fun counts.

    If YAML were a hacked-together once-off data format like Mork or any of dozens of config-file languages, I’d agree with your objections to it. I’m also in complete agreemen with your aversion to custom format – $DEITY knows we have enough of those. In this case, though, YAML is a fine fit, a serialization format and no more. Once the tools argument disappears, there’s no extra benefit to using XML. The file format will be used only to hoist bits from disk to data structure.

    I like XML. I like YAML (though I’m admittedly less familiar with it). This is just a case of picking the easiest tool for the job.

    (and Mike, just in case I haven’t thanked you before, thank you for your help with this little sally. I really appreciate it)

  5. Mike Kozlowski


    I’d misread things so as to believe you were using this for a conf file; if you’re using it for an internal data format, and it’s not especially user-facing, then programmer’s convenience/efficiency nearly always wins.

    As for the draconian error handling, there’s nothing in the XML spec that prevents a parser from handing you a huge pile of errors all at once (it can’t keep parsing and try to fix it up, which makes it draconian). I’m not sure which parsers actually do that, though; haven’t paid attention to it.

    And hey, no problem with the help, but I don’t… oh, the other Mike.

  6. Mike Bruce

    I’ll just drive by and add another Mike, here.

    My opinion w/r/t config files is that you should follow the local custom. If you’re writing some swizzy java web app, you should probably go with XML. If you’re writing a gnome app, you should use gconf (which usually uses an xml back end, but doesn’t have to) and not worry about files at all.

    To put that another way: the user of your app shouldn’t be surprised by your config file format, unless you have a really good reason for it.

    As far as data files, I agree with everybody else.

  7. shaver

    I was going to comment in here, and then Ben mentioned Mork, so now I have to go scrub with lye.

    (Don’t Java apps use .properties files?)

  8. Mike Hoye

    Yeah, I meant to ask you about that: how the hell did that get in there?

  9. Ben Zanin

    The article by jwz that you linked to up top detailed his attempts (and those of others) to parse the Netscape history file, which is stored in the Mork pseudo-database file format [apparently] designed by David McKusker. I read all the parsing attempts and recursed down into the format docs, Usenet postings and Mozilla documentation. They all referred to that awful format as “Mork”.

    Is there some other horrible data format known by that name?

  10. Mike Hoye

    Nope, that’s what we’re talking about.