Heuristics and bestiaries

I’ve worked on more than what I consider my fair share of projects where I had to deal with crazy input. What I mean by crazy input is: it fails to meet the basic criteria of the standard to which it’s supposed to conform, or there is not a standard to which it’s supposed to conform, or standard elements are mashed together in unanticipated ways to try to achieve non-standard effects, or some combination of those.

I suppose that if you dig deep, you’ll find that most of the data in the world is crazy in this sense. So despite my distaste for it, I’ll probably never get away from it. Still, I need to whine about it occasionally. But in addition to whining about it, I’ll present a little practical advice on dealing with it.

The first and most obvious reality in dealing with crazy input is that some heuristic methods will have to be used. For example, in some HTML input, I had to deal with this little number, which, in some documents, occurred between every two paragraphs (well, divs, because these documents don’t use p tags much at all):
<div style="margin-top: 6pt; font-size: 1pt">&nbsp;</div>

These aren’t really paragraphs. Even if you call a div a paragraph, they’re still not paragraphs; they’re just there for spacing. So I have a rule that says something like “If a paragraph is preceded by a paragraph that is essentially empty and has a font size less than 5, remove the empty paragraph and change the spacing on the current one to get the same spacing effect”.

That’s not the best/worst example, but it gets the point across. Heuristics will be necessary. That implies two things: the overall code structure must be adaptable to the addition of heuristics, and I’ll need a bestiary to test the code.

It’s part of the software developer’s mindset to try to neatly partition the entire universe into non-overlapping subsets, then write chunks of code to deal with each partition separately. The introduction of heuristics into such a beautiful scheme will cause some pain. In the beautiful world, I’d have code that says “it’s a paragraph, let’s do the paragraph thing with it”. In a world laden with heuristics, I have code that says “it’s a paragraph, but let’s see if it’s _really_ a paragraph, then we’ll either do the paragraph thing or do some wildly different thing”. I guess it’s less that the code structure has to be adaptable than it’s that my mindset has to be adaptable.

Regardless of the flexibility of my mindset or my code, though, heuristics, by their nature, do not neatly partition the universe. They leave some things out, they overlap, and/or they tangle together in increasingly strange ways. I’ll never remember, when it comes time to add some new code, all the situations that got me to this point or all the ways that things can go wrong.

I’m not yet a full convert to test-driven development, but when dealing with beastly input, I consider a bestiary to be quite necessary. A big set of unit tests, with a perfect specimen of each of the beasts I’ve encountered, each named after the ticket in the ticket-tracking system that brought it to me. I had one project in the past where I should have created a bestiary but didn’t, and that project was one of the worst disasters in my professional life. Another one like that and I would have traded my keyboard in for a shovel and started a new career…

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.