Format xml with XML::LibXML

2022-12-09 perl XML::LibXML format XPath

Here are few notes for formatting XML files using XML::LibXML. Simple example for the different formatting styles using toString serializing function

use v5.16;
use XML::LibXML;

my $data = '<tool><name>TPP1</name><version><major>1</major><minor>0</minor><revision>0</revision></version><installer>TPP1_setup.exe</installer><support_files/></tool>';
my $xml = XML::LibXML->new->load_xml(string => $data);

say $xml->toString(0);

If format is 0 the document is dumped as it was originally parsed

<?xml version="1.0"?>
<tool><name>TPP1</name><version><major>1</major><minor>0</minor><revision>0</revision></version><installer>TPP1_setup.exe</installer><support_files/></tool>

If format is 1, libxml2 will add ignorable white spaces, so the nodes content is easier to read. Existing text nodes will not be altered

say $xml->toString(1);

Generates

<?xml version="1.0"?>
<tool>
  <name>TPP1</name>
  <version>
    <major>1</major>
    <minor>0</minor>
    <revision>0</revision>
  </version>
  <installer>TPP1_setup.exe</installer>
  <support_files/>
</tool>

But this still does not provide optimal results if the XML contains some whitespace in the nodes (if it is not significant, we would like to have it removed). When the whitespace is only between the nodes, it helps to specify no_blanks parser option to load_xml method.

use v5.16;
use XML::LibXML;

my $data = <<END_XML;
                                    <nested_nodes>
                                        <nested_node>
                                        <configuration>A</configuration>
                                        <model>45</model>
                                        <added_node>
        <ID>
            <type>D</type>
            <serial>3</serial>
            <kVal>3</kVal>
        </ID>
    </added_node>
</nested_node>
                                    </nested_nodes>
END_XML

say XML::LibXML->load_xml(string => $data)->toString(1);

Prints

<?xml version="1.0"?>
<nested_nodes>
                                        <nested_node>
                                        <configuration>A</configuration>
                                        <model>45</model>
                                        <added_node>
        <ID>
            <type>D</type>
            <serial>3</serial>
            <kVal>3</kVal>
        </ID>
    </added_node>
</nested_node>
                                    </nested_nodes>

Where with loading like this

say XML::LibXML->load_xml(string => $data, { no_blanks => 1 })->toString(1);

Prints

<?xml version="1.0"?>
<nested_nodes>
  <nested_node>
    <configuration>A</configuration>
    <model>45</model>
    <added_node>
      <ID>
        <type>D</type>
        <serial>3</serial>
        <kVal>3</kVal>
      </ID>
    </added_node>
  </nested_node>
</nested_nodes>

Last instance when the whitespace is combined with actual values and it is not significant when adjacent to tags, it is possible to trim it like this

use v5.16;
use XML::LibXML;
use Text::Trim qw(trim);

my $root = XML::LibXML->load_xml(location => 'input.xml', { no_blanks => 1 })->documentElement;
for my $node ($root->findnodes('//text()')) {
    $node->setData(trim($node->getValue()));
}

say $root->toString(1);

That’s about it for the formatting. I like to keep my XMLs clean.