Monthly Archives: September 2010

General

XSLT exceptions

As of XSLT 2.0, and as-far-as-I-know-correct-me-if-I’m-wrong, there’s no native mechanism in the language for exception handling. (Update: although that’s still true, I should have looked at 2.1 or Saxon’s extension. Though I’m still going with this method because I don’t have 2.1 and I’m not using the EE version of Saxon.)

I have some stylesheets that attempt to do some processing on specified chunks of an input document, copying everything else unaltered. There are rare exceptional conditions that I can’t easily detect before I start producing output for a given chunk. These are rare enough that when I encounter them, all I want to do is cancel processing on this chunk and emit it unaltered. Some sort of exception handling is in order, but XSLT doesn’t help very much.

Here’s an example of the sort of scenario I’m talking about. Here’s an input document:
[xml]
<doc>
<block>some text, just copy.</block>
<!– the following table should have B substituted for a –>
<table>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>b</td><td>a</td><td>c</td></tr>
<tr><td>b</td><td>c</td><td>a</td></tr>
</table>
<block>some more text, just copy.</block>
<!– the following table should be copied unaltered because of the presence of an x –>
<table>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>b</td><td>a</td><td>x</td></tr>
<tr><td>b</td><td>c</td><td>a</td></tr>
</table>
</doc>
[/xml]

I want to look through each table and replace all cell values ‘a’ with ‘B’. However, if there’s an ‘x’ somewhere in the table, I want to just copy the table unmodified. I know that in this case, I could just do a tr/td[.='x'] test on the table to discover this condition. In the real case, though, it’s not so easy to test ahead of time for the condition.

Here’s some XSLT that doesn’t account for the exception:
[xslt]
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="table">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates mode="inner"/>
</xsl:copy>
</xsl:template>

<xsl:template mode="inner" match="td">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:choose>
<xsl:when test=". = ‘a’">
<xsl:value-of select="’B’"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:copy>
</xsl:template>

<xsl:template mode="inner" match="@*|node()" priority="-10">
<xsl:copy>
<xsl:apply-templates mode="inner" select="@*|node()"/>
</xsl:copy>
</xsl:template>

<xsl:template match="@*|node()" priority="-10">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
[/xslt]

The output of that is:
[xml]
<?xml version="1.0" encoding="UTF-8"?><doc>
<block>some text, just copy.</block>
<!– the following table should have B substituted for a –>
<table>
<tr><td>B</td><td>b</td><td>c</td></tr>
<tr><td>b</td><td>B</td><td>c</td></tr>
<tr><td>b</td><td>c</td><td>B</td></tr>
</table>
<block>some more text, just copy.</block>
<!– the following table should be copied unaltered because of the presence of an x –>
<table>
<tr><td>B</td><td>b</td><td>c</td></tr>
<tr><td>b</td><td>B</td><td>x</td></tr>
<tr><td>b</td><td>c</td><td>B</td></tr>
</table>
</doc>
[/xml]

(it did the substitutions in the second table, which I don’t want.)

My current solution is to do this:

  1. Emit each table into a variable instead of directly into the output
  2. If the exception occurs, emit an <EXCEPTION/> tag
  3. After each table is processed, look through the variable for the <EXCEPTION/> tag.
  4. If the exception happened, copy the original table, else copy the contents of the variable.

Here’s the modified code and output:
[xslt]
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="table">
<xsl:variable name="result">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates mode="inner"/>
</xsl:copy>
</xsl:variable>
<xsl:choose>
<xsl:when test="$result//EXCEPTION">
<xsl:copy-of select="."/>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="$result"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

<xsl:template mode="inner" match="td">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:choose>
<xsl:when test=". = ‘a’">
<xsl:value-of select="’B’"/>
</xsl:when>
<xsl:when test=". = ‘x’">
<EXCEPTION/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:copy>
</xsl:template>

<xsl:template mode="inner" match="@*|node()" priority="-10">
<xsl:copy>
<xsl:apply-templates mode="inner" select="@*|node()"/>
</xsl:copy>
</xsl:template>

<xsl:template match="@*|node()" priority="-10">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
[/xslt]

[xml]
<?xml version="1.0" encoding="UTF-8"?><doc>
<block>some text, just copy.</block>
<!– the following table should have B substituted for a –>
<table>
<tr><td>B</td><td>b</td><td>c</td></tr>
<tr><td>b</td><td>B</td><td>c</td></tr>
<tr><td>b</td><td>c</td><td>B</td></tr>
</table>
<block>some more text, just copy.</block>
<!– the following table should be copied unaltered because of the presence of an x –>
<table>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>b</td><td>a</td><td>x</td></tr>
<tr><td>b</td><td>c</td><td>a</td></tr>
</table>
</doc>
[/xml]

It works, but I’m still wondering if there’s a better approach…

General

OK/Cancel

Whenever someone asks me a yes-or-no question, the first answer that pops to mind, though one which I rarely articulate, is always “well…. no, and yes, and yes and no.”.

General

Saxon and generate-id()

I’m using Saxon as my XSLT 2.0 processor in a project. The stylesheet relies on generate-id(), and my unit testing currently relies on string comparison of the output to expected output. Occasionally, generate-id()s values change on me and I get annoyed.

After a little investigation, I believe that the first little bit of the ID comes from some internal index of XML documents that Saxon is keeping track of. Things like <xsl:include/> and document() calls can change these indexes. So there ya go.

General

The anatomy of a semantic mishap

Last night I threw out a bunch of my neighbors’ stuff.

It was an accident. And they recovered their stuff before it was truly lost. And I think this record will show that I did not act too strangely. But I still feel bad about it, and I’m going to be a bit more cautious in such situations in the future. And I like to ramble on uselessly about these sort of weird corner-cases in life, so here goes.

I am the unofficial trash-curber for my duplex (the neighbors I’m referencing are the people in the other apartment). I took on that role when I moved in, after a little discussion with the previous tenant in my apartment, and figuring that it’s easier to just be that guy than to try to work out some scheme to share the work.

Last night, the night before garbage day, I went to put stuff out to the curb. There was the one trash bin full. Closely adjacent to it, probably an inch away, were a computer case and one of those plastic storage bins. Adjacent to those were a sorta homemade-looking plastic/wood thingy and another storage bin. I got the flashlight and inspected more closely. One bin seemed to be mostly empty bottles and stuff. Don’t recall exactly what was in the other bin. Some of this stuff I had seen in a pile elsewhere in the carport for a few weeks.

Overall, my thought process was: the whole list above constitutes a ‘pile’ by virtue of transitive adjacency. On a scale from 0-10, 0 = obvious treasure, 10 = obvious trash, this stuff was all in the 4-9 range by my visual judgment. Nothing in the pile was something I hadn’t seen in trash piles or thrown out myself. Since I’d seen it piled elsewhere, and now it was all in the candidate trash pile, it was probably a final status transition from ‘maybe we need it and will take it into the house some day’ to ‘nah, it’s trash’. So, the candidate trash pile is now a certified trash pile, and I’ll move it all to the curb.

Later that night, I heard some bumping about outside, and when I looked out, all that stuff was gone from the curb. The scavengers in the neighborhood have been known to be fast, but not that thorough, so I kinda figured it must be the neighbors recovering the stuff. This morning, I heard the neighbors leaving, so I ran out and asked, and indeed, I had incorrectly judged the pile, but they got everything back.

This all makes me quite curious about the semantics of trash piles. Trash in general, too. In some web searches, I came across Purity and danger : an analysis of concept of pollution and taboo, which looks pretty interesting. See, I’m not so crazy for being somewhat fascinated by this stuff; some book author is too.

General

Reading open source

I haven’t done a lot of it, but I’ve done a little, and per my recent thoughts about the value of recording accumulated learning in the form of code, I’m wondering whether the wide availability of open source for so many types of software will make it a lot more prevalent and show a whole facet of the value of open source that hasn’t been addressed too often.

What I mean by ‘it’ is reading source code in order to learn about the field in which the code is used. With good code and skill at navigating and reading it, I bet a lot of answers to questions about a field can be found, along with related structures and details and exceptions that aren’t always captured effectively in books or expert answers. Hmm, I say.

General

“Software as capital”

This book is pretty interesting so far: Software as capital: an economic perspective on software engineering. He has a nice statement of bit rot in economic terms, for example.

I think one thing that really appeals to me about this book is that it’s helping paint one of the pieces of the puzzle of my life.

If I’m anything, I’m a creature of learning. Baetjer points out quite explicitly something I’ve been coming to realize slowly over my career: software development is, more than anything, a process of social learning. A software developer enters the world of the user and helps establish a process to out all the strange little bits of knowledge hidden in the corners of every sort of human endeavor. Code is a way to structure this learning so that it can be shared, studied, remembered, and, maybe most importantly, incrementally accumulated as those bits of knowledge come out.

When I was in school, I could never really take notes. I found that it was easier to pretend to take notes than to just sit there; partly, that kept me awake, partly, it was a social ritual. But it was never much of a learning tool. How could I engage with my notes any better than I could engage with the lecture? But code is a different sort of record of learning. Maybe it’s that compilers and the actual execution of the code keep us more honest and force us to be more thorough. Maybe it’s the fact that the visual structure of a computer language on the screen fits its semantic structure a lot better than with human languages. Those aspects and the discipline that grows around them make it a lot easier to slowly build an effective learning repository that elicits the abstract structures of the knowledge while not losing any of the little details. I love the feeling I get when I’m in the middle of a big software system… I’m not sure what’s a good analogy to help explain what that means; perhaps everyone’s familiar with the feeling of really knowing the geography of the city they live in. The feeling when someone asks you a question about where something is, half of your brain activates instantly and simultaneously with a complete map of the route from here to there, with alternate routes and associative connections to attractions and landmarks along the way and estimates of how long it will take to get there. The feeling that it’s inside you as much as you’re inside it. Those sort of feelings arose only by accident in school; software development is a pretty reliable methodology for generating this very deep sense of understanding.

So surely that has something to do with why I continue to be a software developer…

General

ASZip bug, fix

In case you use the nice little library ASZip to create zip files from Flex, be aware that there’s a bug in the released code v0.2, but that there’s a patch. The bug is described and a patch provided in Issue 1. The same bug causes Excel to not like the files, Issue 2.

Just thought I’d blog about this since I rediscovered and refixed the bug before realizing the issue tracker had a patch. I’m not sure why the patch isn’t checked in; apparently the author forgot about it or was otherwise prevented from working on the library more.

General

regex coverage

I’m slowly incorporating more automated testing into my development workflow these days. But there are some holes in the available development tools that make it hard to make that a reality through all my code.

For example, I have some slightly tricky regular expressions, embedded in XPath expressions. These are in code that tries to heuristically recognize certain constructs within a class of human-generated documents and do special processing on them. Developing a heuristic presents special problems, for me, at least, in that the specification is vague and can only be refined by iterative feedback based on data from the field. I get wary about these things… so the more testing I can do, the better. I also would like to ensure that my tests are covering the code I’m writing. In this case, the regexes and XPath expressions are code just as much as the Java and XSLT are.

I asked on Stack Overflow whether there are tools out there for measuring code coverage for regexes. That’s only a subset of my overall problem, but I thought it was interesting enough to see what might be out there. So far, nobody has pointed me to a tool that specifically does that.

I played around for an hour with an automaton library and a graph library to try to visualize coverage of a DFA implementation of a regex against a set of test strings.

DFA coverage graph

Clearly not great, but the approach might yield something if someone (probably not me) were to refine it. I had some fun with it and felt like I was getting something from the process.

Another approach that I thought of while working with this was to generate exemplars from the regex (since that’s pretty easy), verify by hand that they fit the specification, then use those as a starting point for test cases. That might seem a little backward, but it does a couple things: during the verification step, you’re learning whether your regex really recognizes what you expect it to, and once you’ve got your list of verified test strings, then later changes to the regex might cause useful test failures. If you generate ‘enough’ exemplars, then you know you have test coverage.

How you’d define ‘enough’ is a non-trivial question. In my experiment above, I was looking at which states and transitions of the DFA were covered. Those are at least two levels of coverage you could look for, but it seems that some sort of path coverage would be nice, too. There are a couple projects on the web that generate strings from regexes: Xeger and Rex. They both use a randomized approach, so they aren’t going to guarantee any specific sort of coverage. And indeed, it may be that for a given regex that’s so complicated it needs test-coverage-measurement, the set of inputs you’d need to truly cover it would be crazy-big.

Which leads me to another thought: if you are using XPath or regex that complicated, maybe it’s not worth it. I mean, it’s all cool and clever to make one (220-character long) expression that does something complicated, but if you can’t really feel confident about how well it works by inspection, then it’s probably too dense to be readable and maintainable. On the other hand, the XSLT compiler can probably do a more efficient job of finding the nodes in question than some hand-written Java or something, and if you want to avoid mixing other languages in with your XSLT process… It’s a puzzle.

General

Scanners’n’displays

Just in case you were wondering: laser barcode scanners can’t read off an LCD display, but they can read from an e-ink display (Kindle 3, at least).

(That’s what I predicted.)

General

Concision

I’m not very good at being concise. I wonder how people do it.