People often reach for regular expressions to extract and rearrange information in XML documents. Those usually only work for the limited test cases people specifically target, but are really little time-bombs waiting to go off when the data or the format changes even slightly. The bomb often explodes after the original programmer has disappeared.
It’s much easier to deal with XML with a proper XML tool, such as XML::Twig. It uses expat behind the scenes, so it handles all of the format and structural details for you and let’s you focus on the higher-level concepts. There are two ways that XML::Twig can process data: either by parsing it completely into a tree structure first then processing it, or parsing and processing it at the same time. Which one you use depends on your data sizes and what you need to do. In this Item, you’ll have tiny data so you’ll parse it completely before processing it. You’ll see the other method in a later item.
Start with some XML output from Subversion’s svn
command-line tool, just because it’s an easy source for XML text. You can add the --xml
option to most of its options, such
as info
:
$ svn info --xmlhttps://svn.example.com/svn/trunk https://svn.example.com/svn 83cfdd44-1cdd-11df-9fc2-77fecae625a6 normal brian.d.foy 2010-02-20T00:13:20.439142Z
Say that you want to extract the commit information. You could write a simple regular expression to extract it from exactly the format in that example output, but what happens if the Subversion developers change the order of the elements? Boom! Your bomb explodes!
XML::Twig makes this very simple to implement and even easier to maintain. First, you create a twig object and tell it what to parse:
use XML::Twig; # run from a directory under svn control my $xml = `svn info --xml`; my $twig = XML::Twig->new; $twig->parse( $xml );
Once parsed, the twig is a tree structure of the XML data and you can do various things with it. To start your XML processing, you need to get one of the elements from the twig. There are many ways to go about this, but in this case the easiest thing to do is to get the commit
element with first_elt
. Save that in $commit
because that’s the object you’ll interact with to extract information:
use XML::Twig; # run from a directory under svn control my $xml = `svn info --xml`; my $twig = XML::Twig->new; $twig->parse( $xml ); my $commit = $twig->first_elt( 'commit' );
You can then use att
to extract the attribute named revision
from $commit
element:
use XML::Twig; # run from a directory under svn control my $xml = `svn info --xml`; my $twig = XML::Twig->new; $twig->parse( $xml ); my $commit = $twig->first_elt( 'commit' ); my $revision = $commit->att( 'revision' ); print "Revision $revision\n";
The twig doesn’t care about attribute order, position on the line, or anything else. It knows how to get the right information, so the output now shows just the revision number:
Revision 37
Expand this a bit to pull out more information. If you want the committer date and the name, that’s easy too. You can use first_child_text
to extract the data for the child tags with the names that you specify:
use XML::Twig; # run from a directory under svn control my $xml = `svn info --xml`; my $twig = XML::Twig->new; $twig->parse( $xml ); my $root = $twig->root; my $commit = $twig->first_elt( 'commit' ); my $revision = $commit->att( 'revision' ); my $author = $commit->first_child_text( 'author' ); my $date = $commit->first_child_text( 'date' ); print <<"HERE"; Revision: $revision Author: $author Date: $date HERE
XML::Twig has many methods to work with elements. In the previous example you extracted some information and left the XML data as it was. You can also transform the data so its different at the end. If you just want the commit information, you can throw out everything else. Parse the XML data in the same way and get the element for the commit
element, but once you have that element, use set_root
to make commit
the new top-level:
use XML::Twig; # run from a directory under svn control my $xml = `svn info --xml`; my $twig = XML::Twig->new( pretty_print => 'nice' ); $twig->parse( $xml ); my $root = $twig->root; my $commit = $twig->first_elt( 'commit' ); $twig->set_root( $commit ); $twig->print;
At the end of your twig processing, you use the print
method to show the new XML structure, which is now just the element that you selected:
brian.d.foy 2010-02-20T00:13:20.439142Z
Now you want to remove that attribute named revision
and make it an element instead. You use del_att
to remove the attribute then insert_new_elt
to create the new element under commit
:
use XML::Twig; my $xml = `svn info --xml`; my $twig = XML::Twig->new( pretty_print => 'nice' ); $twig->parse( $xml ); my $commit = $twig->first_elt( 'commit' ); $twig->set_root( $commit ); my $revision = $commit->att( 'revision' ); $commit->del_att( 'revision' ); $commit->insert_new_elt( revision => $revision ); $twig->print;
Now the output has an extra element for revision
:
16 brian.d.foy 2010-02-20T00:13:20.439142Z
That's all there is to it. Not only does XML::Twig handle the task correctly, but it's also a lot easier to program and to understand than a regular expression. XML::Twig
has many, many more methods that allow you to interact with elements in many different ways to get just the information you need or change that parts that you want.
Thanks. That’s a quick intro to XML::Twig. I would like to link the complete tutorial for more information. http://www.xmltwig.org/xmltwig/tutorial/index.html