The thing is: I hate XML.

OK, not exactly. I hate the way they abuse it, especially in the corporate world. XML, and Java (or Python lately), and no alternatives allowed. Often, where even CSV would do better – they stick in an XML-based bloatware and get always happy about it!

But then it turns out that I have this bunch of very different PDFs on my hands, from different sources, produced by various software tools, but with one thing in common: they all contain data I must pull in and sort out. Lucky me, nobody puts limits on what this has to be implemented in: “Raku? What is it? Whatever, just make it work!” Fantastic!

Next turn, what do we have for reading PDFs? Oh, oops… Can I convert them into something different? Sure, you can! Would you be happy about XML? Er… Well, yes, I suppose.

In a while, along with the PDFs, an XLSX spreadsheet pops up. And you all know what is it internally… Moreover, back then it became apparent to me that Spreadsheet::XLSX lacks support for some key features of the format and I came up with a couple of PRs to implement them. Along the lines I basically developed a small core for de-/serializing XML into Raku objects. It’s limited and only sufficient to serve the needs of XLSX parsing, but it felt like having some interesting potential. Especially with regard to the tasks I already had on my hands.

Yet, before doing something as stupid as starting a new project, I checked around first and, apparently, stumbled upon XML::Class. Unfortunately, the XML module, it is based upon, proved to be too slow for the files I’m dealing with, LibXML is doing way better. Still, XML::Class provided me with a couple of great ideas.

And here we are today: welcome the LibXML::Class module! A swiss army knife of XML serialization for The Raku language.

First of all, the principle I’m trying to follow any time something new is planned: make it easy to start with, yet make it very capable:

use LibXML::Class;
class Record is xml-element {
    has Str $.field1;
    has Int:D $.field2 is required;
}

say Record.new(
        field1 => 'The Answer',
        field2 => 42
    ).to-xml;

say Record.new( field2 => 12 ).to-xml;

How much easier could this ever be? There is even better example in the repository, but I’m not including it here now because I have different plans for it.

Ok, this is about simplicity. What about capability? The full list of module’s features can be found in its README and the manual explains them in details, though due to lack of time no proofreading has been done for it and errors of different kinds are guaranteed. That’s why I tried to cover most important topics with examples. In this post I’d cover just the most important ones.

Lazy Deserialization

Say, we have a huge XML with complex structure. Full and immediate deserialization of it would result in hundreds or thousands of instances of Raku classes created. No fast parser would be of much help here because of the time it takes to run all the constructors of each and every object. If one needs just one attribute of a single records somewhere in the structure it’d be definitely stupid waste of computing time and memory and, worst of all, end user patience.

This is not our way. LibXML::Class doesn’t deserialize until it is really necessary. Consider an example from the repository:

use v6.e.PREVIEW;
use LibXML::Class;

class Record is xml-element {
    has Str:D $.data is required;
    submethod TWEAK {
        say "+ record";
    }
}

class Root is xml-element {
    has Record:D $.record is required;
    submethod TWEAK {
        say "+ root";
    }
}

my $root = Root.new: record => Record.new(:data("some data"));

say "--- deserializing";
my $root-copy = Root.from-xml: $root.to-xml.Str;
say "--- deserialized";
say $root-copy.record;
say "--- all done";

Running it would result in an output like this:

+ record
+ root
--- deserializing
+ root
--- deserialized
+ record
Record.new(data => "some data", xml-name => "Record")
--- all done

The first two lines are rather understandable: we create a record, then the root object. Hence the prints from their constructors. But starting from ’— deserializing’ line the events get more interesting twist. We only see an output from Root’s constructor, but there is nothing from Record. That is because at this point $.record is not initialized yet. LibXML::Class is using AttrX::Mooish to turn the attribute in a lazy one as if somebody applied is mooish(:lazy<xml-deserialize-attr>) trait to it. The effect of this action is visible right after the end of deserialization is reported with ’— deserialized’ line. There you can see ’+ record’ from Record’s TWEAK submethod first, and only after that there is a gist of the record object itself. Both are easily pinpointed to the say $root-copy.record line, where a read from $.record resulted in the object being eventually deserialized and made ready for use.

Now, imagine that the Record itself has sub-records, and sub-sub-records, and there are lots and lots of them. But your code doesn’t waste time on instantiating – unless explicitly requested to do so as, apparently, this behavior can be disabled if necessary. Moreover, it can be triggered on or off at individual level per attribute.

This functionality is not activated implicitly for basic-type attributes like strings, numerics, etc. But one can enforce it per-attribute, if this is considered helpful

XML Sequences

Working on Spreadsheet::XLSX introduced me to such pretty curious entity as XML sequence. From Raku’s perspective it would be a Positional, and an Iterable; but neither a List nor a Seq. Well, in theory it is possible to map it into one, but that’d be rather tricky and unnatural. Here is the most challenging features of a sequence:

  • it can be a multi-type thing; i.e. it may contain different XML elements
  • it could contain a huge number of elements
  • elements are not necessarily come from the same namespace

Perhaps I miss something here, but even these three points make it somewhat special.

Sure, with certain amount of patience and obstinacy, one could implement them as arrays, but here come one barely solvable problem: an array attribute would still be deserialized as a whole simply because there are no lazy arrays in Raku!

Here comes a solution (see another example):

use v6.e.PREVIEW;
use LibXML::Class;

class Ref is xml-element<ref> {
    has Str:D $.ISBN is required is xml-element;
    has Int:D $.page is required is xml-element;
    submethod TWEAK {
        say "+ ref for ", $!ISBN;
    }
}

class Index is xml-element( 'index',
                            :sequence( Ref, :idx(Int:D) ) )
{
    has Str:D $.title is required;
}

my $index = Index.new: title => "Experimental";

$index.push: 42;
$index.push: Ref.new(:ISBN<1-2-FAKE>, :page(10));
$index.push: Ref.new(:ISBN<3-4-MOCKED>, :page(1001));

say "--- deserializing";
my $index-copy = Index.from-xml: $index.to-xml.Str;
say "--- deserialized";
say $index-copy[1];
say "--- all done";

Along the lines, the sample also demonstrates how advanced capabilities of LibXML::Class get activated when necessary.

Anyway, running this would result in the following output:

+ ref for 1-2-FAKE
+ ref for 3-4-MOCKED
--- deserializing
--- deserialized
+ ref for 1-2-FAKE
Ref.new(ISBN => "1-2-FAKE", page => 10, xml-name => "ref")
--- all done

And, again, we observe laziness in action! As only the single item on the sequence is read from – only single output is produced by the TWEAK. There is a difference to the attributes though: XML sequence is totally and unexceptionally lazy. No sequence item is deserialized until read, not even the basic type ones.

Now, let’s get back to where it started. In an XLSX worksheet rows are sequence elements (items in terms of LibXML::Class Raku representation); same apply to individual cells. Now, imagine full deserialization of a sheet consisting of thousand lines with hundreds of columns! Nah, gimme a break… Of course, that would still mean full parsing of the XML, but, unfortunately, it’s unavoidable cost. Yet, it doesn’t mean that there gonna be piles of LibXML::Node objects scattered all around your RAM! Fortunately for us, most of the work would be done at the low level by libxml and only pulled up into the Raku land when necessary. In other words, these ops are also mostly lazy.

Namespacing

This is where I both love it and hate it. Use of namespaces in XML helps resolve so many problems that often times XML is the only answer to complex problems. Though in my view a way less verbose format could’ve been developed for these tasks, but it’s too late for any discussions now. Hence my only complain here is about using XML where there are more appropriate formats that would do better.

Anyway, my goal was to cover as many different combinations of using namespaces as I could. Considering that we have default namespaces (these defined with xmlns="..."), and we have prefixes, and we have priorities, inheritance, override, and perhaps some other things I forget about; and there are rules on how they apply and match; and that the way one see and use them with Raku objects must look and feel the same as for XML (or, at least, for LibXML implementation of the standard); so, considering all the above, in the retrospective, I’m not surprised that more than 80% of code development time has been spent on namespacing. Parts of the code underwent like 5-10 rewrites – basically, the count has been lost long ago…

But it’s definitely worth it. Just by looking at the module’s SYNOPSIS you can see how simple is it to declare and refer to namespaces!

Then you start reading the manual.

Then you come down to see an example of deriving namespaces.

Then there is an example of imposing namespaces.

And only then it gets apparent how convoluted namespacing could be. Yet, we can handle if not all possible variations of it, but a significant subset, most certainly!

BTW, another feature I wouldn’t focus upon but wanna mention anyway is XML:any technique, which is heavily based upon namespaces. This is an idea I borrowed from XML::Class but gave it some extra capabilities, especially in the area of XML sequence items.

Examples are intentionally omitted in this section due to their rather bloated sizes.

Searching

Here comes real magic!

When I encountered LibXML, aside of its speed, what made me attracted (let’s avoid emotionally attached term, though…) is its XPath-based findnodes. And when I came down to the idea of LibXML::Class the method was one of the two most significant reasons I wanted the module to exists in first place.

Well, you know what? I nearly forgot to implement it, after all. The namespaces, you know: they sucked every last bit of energy out of me. But, down with them…

Can you spot a catch here? I’ll give a hint: laziness. It wouldn’t be a big deal to map a LibXML::Node to its deserialization because there is the unique-key method which lets us keep track of objects. But what if there is no object to track yet? What if the node we found is so deeply nested in the source XML that not only there is no deserialization for it, but for a couple of its parent nodes too?

Solved. The feature can be observed in action in modified version of SYNOPSIS code. Tests 200-pml-parser.rakutest and 150-find.rakutest are even better in demonstrating the feature, but they’re apparently harder to read. 200-pml-parser.rakutest is specifically focused on searching for undeserialized yet nodes.

I’m once again avoiding any full samples here. They’d be too big. Just to give you an impression on how it works, here is the single line which would be at the core of most searches:

my LibXML::Class::XML:D @deserializations = $root.xml-findnodes(q«//*[@idx = 3002]»);

This is all needed to find deserializations for all XML elements with idx attribute set to 3002.

Wait, don’t go! Just one another line and we’re almost done here:

my Str:D @names = $root.xml-findnodes(q«//*[@idx = 3002]/@name»);

This is how find attributes name of our elements. So, let’s say there is something like this in our Raku:

role Named is xml-element {
    has Str:D $.name is xml-attribute;
}

role Indexed is xml-element {
    has UInt:D $.idx is xml-attribute;
}

class Record is xml-element<rec> does Named does Indexed {
    has SubRecord $.sr is xml-element<subrec>;
}

The by adding @name to the XPath we’d get $.name of the role Named. If there are other classes consuming it and deserialized from the same XML then we gonna get their attributes too, perhaps. Surely, it depends on $.idx values.

Pardon? Haven’t I told about roles? Oh, my bad! Well, you see them supported. As well as subclassing. Let’s not focus on this.

What’s more interesting is that search works with object cloning. It means that even there is 100% probability that there is single XML element to be found, a sequence would always be returned for a particular XPath expression. Because if a deserialization gets cloned the first thing the newborn copy does is registers itself with the object registry.

And, sure enough, if search is not needed then turning it off altogether would spare you some memory and processing times.

Done

Now I say myself: stop! Or a post would turn into a secondary manual. Before we say each other “see ya!” I would have one single request to you: if this module ever makes you want to use it – please, make sure it’s not for a manually editable config of your application! XML is great when used properly; and “properly” means to me: read and written by and only by code, never by human eyes and human hands!

Comments