regex coverage

I’m slowly incorporating more automated testing into my development workflow these days. But there are some holes in the available development tools that make it hard to make that a reality through all my code.

For example, I have some slightly tricky regular expressions, embedded in XPath expressions. These are in code that tries to heuristically recognize certain constructs within a class of human-generated documents and do special processing on them. Developing a heuristic presents special problems, for me, at least, in that the specification is vague and can only be refined by iterative feedback based on data from the field. I get wary about these things… so the more testing I can do, the better. I also would like to ensure that my tests are covering the code I’m writing. In this case, the regexes and XPath expressions are code just as much as the Java and XSLT are.

I asked on Stack Overflow whether there are tools out there for measuring code coverage for regexes. That’s only a subset of my overall problem, but I thought it was interesting enough to see what might be out there. So far, nobody has pointed me to a tool that specifically does that.

I played around for an hour with an automaton library and a graph library to try to visualize coverage of a DFA implementation of a regex against a set of test strings.

DFA coverage graph

Clearly not great, but the approach might yield something if someone (probably not me) were to refine it. I had some fun with it and felt like I was getting something from the process.

Another approach that I thought of while working with this was to generate exemplars from the regex (since that’s pretty easy), verify by hand that they fit the specification, then use those as a starting point for test cases. That might seem a little backward, but it does a couple things: during the verification step, you’re learning whether your regex really recognizes what you expect it to, and once you’ve got your list of verified test strings, then later changes to the regex might cause useful test failures. If you generate ‘enough’ exemplars, then you know you have test coverage.

How you’d define ‘enough’ is a non-trivial question. In my experiment above, I was looking at which states and transitions of the DFA were covered. Those are at least two levels of coverage you could look for, but it seems that some sort of path coverage would be nice, too. There are a couple projects on the web that generate strings from regexes: Xeger and Rex. They both use a randomized approach, so they aren’t going to guarantee any specific sort of coverage. And indeed, it may be that for a given regex that’s so complicated it needs test-coverage-measurement, the set of inputs you’d need to truly cover it would be crazy-big.

Which leads me to another thought: if you are using XPath or regex that complicated, maybe it’s not worth it. I mean, it’s all cool and clever to make one (220-character long) expression that does something complicated, but if you can’t really feel confident about how well it works by inspection, then it’s probably too dense to be readable and maintainable. On the other hand, the XSLT compiler can probably do a more efficient job of finding the nodes in question than some hand-written Java or something, and if you want to avoid mixing other languages in with your XSLT process… It’s a puzzle.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.