Friday, May 7, 2010

The X (Path) File

by Eduardo Rodrigues

This week I came across one of those mysterious problems where I had some test cases that needed to verify the content of some DOM trees to guarantee that the test went fine. So, of course, best way to achieve this is using XPath queries. And because the DOM trees involved were all quite simple, I figured writing the XPath queries to verify them would be like a walk in the park. But it wasn’t.

I spent hours and hours trying to figure out what was I doing wrong, googling around but nothing seemed to make any sense at all. Then, just when I was almost giving up and throwing myself through the window, I finally realized that tiny little detail that explained everything and pointed me out to the right solution. The culprit was the default namespace specified in my root element!

Turns out, whenever a namespace URI is specified without any prefix (like xmlns=””), this is considered to be the document’s default namespace and it usually don’t affect parsers. But, as I found out, it does affect XPath big time. XPath, by definition, will always consider namespaces, even the default one. The problem with that is, because a default namespace don’t have any specific prefix, we completely lose the ability of using the most simple and common path-like approach when writing queries to locate nodes in the DOM tree.

Here’s a very simple example that illustrates the issue very well. Consider the following well-formed XML:

<?xml version="1.0" encoding="iso-8859-1"?>
<HR Company="Foo Inc.">
    <Dept id="1" name=”Board”>
        <Emp id="1">
            <Name>James King</Name>
        <Emp id="10">
            <Name>Jon Doe</Name>
        <Emp id="20">
            <Name>Jane Smith</Name>

If I want to check if there’s really an employee named “Jane Smith” earning a 100K salary in the “Board” department, a very simple XPath query such as “//Dept[@name='Board']/Emp[string(Name)='Jane Smith' and number(Salary)=100000]” would easily do the job.

Now just add an innocent default namespace to the root element:

<HR xmlns=”” Company="Foo Inc.">

and try that very same XPath query that worked so well before. In fact, even the most simple of all queries – “/” – won’t work as expected anymore. That’s just because XPath considers the default namespace context and therefore requires it to be referenced in the query. We just don’t have any way of referring to that namespace in the query since it doesn’t have any prefix associated to it. My particular opinion on this issue is that it represents a huge design flaw in XPath specs., but that’s a completely different (and now pointless) discussion.

Unfortunately, there’s no magic in this case. To keep using XPath queries in this kind of situations, we need to use a more generic (and less compact) syntax where we can be more specific about when we do care about fully qualified (or expanded) names and take namespaces into consideration, or if we just care about local names but do not about namespaces. Bellow is the very same query, using this more generic syntax and these 2 different naming flavors, both providing the exact same outcome:
  1. If you need (or want) to consider the namespace:
    //*[namespace-uri()=’’ and local-name()=’Dept’ and @name='Board']/*[namespace-uri()=’’
    and local-name()=’Emp’ and string(Name)='Jane Smith' and number(Salary)=100000]
  2. If you just care about the elements’ names, then just remove the “namespace-uri” conditions:
    //*[local-name()=’Dept’ and @name='Board']/*[local-name()=’Emp’ and string(Name)='Jane Smith' and number(Salary)=100000]
The reason why I prefer to use function local-name() instead of name() is simply because, together with namespace-uri(), this is the most generic way of selecting nodes since local-name() doesn’t include the prefix, even if there is one. In other words, even if you had a node such as <hr:Dept>, local-name() would return simply “Dept”, while name() would return “hr:Dept”. It’s much more likely that the prefix for a particular namespace will vary amongst different XML files than its actual URI. Therefore, using predicates that combine functions namespace-uri() and local-name() should work in any case, regardless of which prefixes are being used at the moment.