Easily getting the correct xml:lang attribute for an element using XPath

This is quick interlude before I go back to finishing up the write up about dealing with extensions to DDI.

_____________________

A problem I’ve come up against numerous times when dealing with XML processing is the correct handling of (human) languages in XML. The issue come up because the XML 1.0 Specification for language identification is quite clear about how to determine the language of an element in an XML document – use the elements xml:lang attribute, if that doesn’t exist use the nearest ancestors xml:lang tag (nearest in this sense means the closest in the document – eg. parent beats grandparent beats great-grandparent and so on…)

Unfortunately, this convention doesn’t always carry over into XML processors. In fact, I don’t think I’ve come up against an XML processor that correctly handles this cascading effect. Fortunately, this actually a trivial expression in XPath (but again not one that I’ve seen before), shown below:

ancestor-or-self::*[attribute::xml:lang][1]/@xml:lang

What this does, is quite simple (in fact describing this will mean almost retyping the above part about the XML specification). From the list of this element and all its ancestors, find those that have an xml:lang attribute, grab the first (and thus nearest) one and return the value of its xml:lang attribute.

I’ve put together an XSLT that demonstrates this in practice, and adds the correct implied xml:lang attribute to every element in a document. The gist also includes a sample input and output file that shows how it should work, that I’ve included below:

If we run the XSLT across this file:

<?xml version="1.0" encoding="utf-8"?> <data> <foo xml:lang="en"> <bar> <tog xml:lang="fr"> <wel> <vay xml:lang="sv"/> </wel> </tog> <tog> <wel> <hut xml:lang="it"/> </wel> </tog> <tog xml:lang=""/> </bar> </foo> </data>

We should get this output:

<?xml version="1.0" encoding="utf-8"?> <data xml:lang=""> <foo xml:lang="en"> <bar xml:lang="en"> <tog xml:lang="fr"> <wel xml:lang="fr"> <vay xml:lang="sv"/> </wel> </tog> <tog xml:lang="en"> <wel xml:lang="en"> <hut xml:lang="it"/> </wel> </tog> <tog xml:lang=""/> </bar> </foo> </data>

While in practice there is a decision to be made as to whether xml:lang tags should be added to an XML file at the start of processing – for example to make finding every element that is in English easier. But in either way, determining what the language of an element now is much simpler.

The full Gist of this is up on GitHub, feel free to fork and improve this if necessary – or even point out where others have done this before, or why this may not even be necessary).