The growth of XML as a data exchange standard is often seen as a positive development in terms of the chaos it replaced. The relationship the world has with XML, however, is not built on pure love. It’s interesting to know a bit about the problems people have with the standard, especially if we can grasp why, if it has all these problems, XML is still proving itself to be so pervasive.
1. It’s hard to parse
XML data structures can destroy your soul by just looking at them. Now, if this were just about our meatware being wholly inadequate anyway but, at least, would’ve allowed programs to have an easy time parsing these beasts, that would have been just fine. LISPy S-expressions can be parsed by an amoeba. The spec for JSON fits in your browser without scrolling for crying out loud. But what do you know, XML syntax can be quite quirky, what with namespaces, escaping and the hundreds of ways to get things wrong.
2. Seriously, it’s insanely hard to parse
Any sufficiently large data structure will look intimidating. People do insane things with XML. Mindnumbingly insane things. It’s like the ‘porn rule’: Think of the stupidest way you can imagine to encode a data structure into XML, used to encode the stupidest data structure you dare to dream of, multiply that by two and then realilze that someone is doing that right now.
So why isn’t everybody using your favorite serialization standard?
Since we’ve already established how XML is hard to parse, how come people aren’t flocking to alternatives? Why won’t XML just die? Why, why? I think the answer is attributes. Your favorite serialization standard does not have them:
- JSON - nope
- S-expressions - fuggeddit
- Serialized PHP - you’re joking, right?
- NextSTEP plists - nuh-uh
- YAML - no matches for ‘attribute’, plus its complicated rules end up making the standard even more insane than XML.
So why do these standards not support attributes, you may ask? Surely if we take one of these and add attributes, we would end up with a superior standard? This is where reality will bite. I’ve been toying with these ideas and my preliminary conclusion is that adding attributes to an encoding standard will make it, you guessed: harder to parse!
Who needs attributes anyway?
In the spirit of KISS and in defense of one of the standards that is currently easier to parse, people have been into various forms of denial about attributes. The unescapable fact is that the concept of attributes is tremendously useful. It allows for distinguishing two kinds of properties that can be attached to a single node in a data tree: Data attributes that say something about the node and child nodes that the node owns.An illustration of this distinction inside a ‘tree only’ object model can only be accomplished through cumbersome solutions:
- Using a magic prefix, like object['__myattribute'] versus object['somechild']. Oh, and you just ruined your namespace.
- Making a second step in the hierarchy, so you do object['attr']['myattribute'] and object['children']['somechild']. You’ll clearly enjoy those heroic ventures into /object/children/child1/children/child2/… and you will run into a need for that.
- Some hybrid form, like using object['__attr__']['myattribute']. Not too shabby, but you still lost the ability for a node to have direct data alongside attributes, so you will have to access that through object['__data__'] or some such. You also polluted your namespace again, just not as badly as before.
Since there are data models that only XML can express while XML can easily express the models of other encodings, the standard is not likely to be going anywhere any time soon.
XML and OpenPanel
With that in mind, we still picked JSON for OpenPanel’s RPC because it was assumed to be faster to parse and easier to map to javascript options on the GUI side. We dealt with a PHP middle layer during earlier builds that we fed serialized PHP arrays, so we already had to stick to the lowest common denominator with regards to data modeling the RPC structures. With the PHP layer gone and the CLI not giving a hoot about what format is spoken, JSON sounded like the best idea. Adding an XML mapping to this will not be a problem.
On the side of modules, we’ve always been heavy on using XML structures to define the make-up and layout of the object tree inside the opencore database. Lately, this has been bugging me for a bit. The module.xml format has mostly organically grown around the changes inside opencore’s object database and module interaction API as they came about. It’s not the most friendly of XML files for developers to edit, though, and module programming is something we want to encourage as much as possible. So I came up with a text format that allows for the mixture of data, attributes that is typically needed but is easy to parse, for humans and machines alike.
Here’s the syntax:
# simple tree
Person john
string name
dict address
string email
string msn
string homepage
In Pythonesque fashion, we parse parent/child structuring through the indent level. Let’s add some data:
# tree with values
Person john
string name : John T. Ripper
dict address
string email : johnt@ripper.co.uk
string msn : johnt152@hotmail.com
string homepage : http://www.ripper.co.uk/~johnt
And, finally, let’s add those dreaded attributes:
# tree with values and attributes
Person john < recordupdated true
< recordsaved false
string name : John T. Ripper
dict address
string email : johnt@ripper.co.uk
< hide true
< sendmarketing false
string msn : johnt152@hotmail.com < hide true
string homepage : http://www.ripper.co.uk/~johnt
That’s all. It’s too limited to ever replace XML, but it matches the subject layout of our module.xml perfectly, so it can be used to generate the XML files for people who like it, but sticking to directly editing module.xml is just fine. It’s more an experiment in purpose-driven textualization of a data model than it is trying to be what m4 became for sendmail.cf.
For a comparison view: module.xml versus module.def.