Open Packaging Convention (OPC)

2024-06-13 perl opc sldd Archive::Zip XML::LibXML part .rels XPath

I recently needed to work with Matlab Simulink Data Dictionary (SLDD) files. Playing a bit with the format, I found it is based on Open Packaging Convention, which is specified as part of ECMA-376 standard.

The thing is just a zip with pre-defined structure. Main portion is a file with list of relationships in _rels/.rels file. It is a XML that looks like this:

<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId1" Target="data/chunk0.xml"                          Type="http://schemas.mathworks.com/simulink/2010/relationships/dictionaryChunk"/>
  <Relationship Id="rId2" Target="metadata/mwcoreProperties.xml"            Type="http://schemas.mathworks.com/package/2012/relationships/coreProperties"/>
  <Relationship Id="rId3" Target="metadata/mwcorePropertiesExtension.xml"   Type="http://schemas.mathworks.com/package/2014/relationships/corePropertiesExtension"/>
  <Relationship Id="rId4" Target="metadata/mwcorePropertiesReleaseInfo.xml" Type="http://schemas.mathworks.com/package/2019/relationships/corePropertiesReleaseInfo"/>
  <Relationship Id="rId5" Target="metadata/coreProperties.xml"              Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties"/>
</Relationships>

The Relationship items then refer to other parts of the package via Target attribute. I noticed that I worked with the format using C# before as covered in Get document category post, where I extracted some data from Word .docx files.

This time I needed a solution in perl. Searching on CPAN did not give me anything that I would like, so I ended up creating my own solution, called Parser::OPC.

First, lets build simple storage for a part

package Parser::OPC::Part;
use Moo;
use Types::Standard qw(Int Str Enum ArrayRef InstanceOf);
use Function::Parameters;

has type   => (is => 'ro');
has id     => (is => 'ro');
has target => (is => 'ro');
has source => (is => 'ro', isa => InstanceOf['Parser::OPC']);

method contents() {
    return scalar $self->source->zip->memberNamed($self->target)->contents();
}

method contents_xml() {
    return XML::LibXML->load_xml(string => $self->contents());
}

1;

It has storage for all attributes of the Relationship, and one reference back to original package, so we can access the part data and parse them. Then something to represent the OPC file itself

package Parser::OPC;
use Moo;
use Function::Parameters;
use Types::Standard qw(ArrayRef InstanceOf);
use Archive::Zip qw(:ERROR_CODES :CONSTANTS);
use XML::LibXML;

use Parser::OPC::Part;

my $RELATIONSHIPS_NS = 'http://schemas.openxmlformats.org/package/2006/relationships';

has filename => (is => 'ro', required => 1);
has zip      => (is => 'lazy', init_arg => undef);
has parts    => (is => 'lazy', isa => ArrayRef [InstanceOf ['Parser::OPC::Part']], init_arg => undef);

method _build_zip() {
    my $zip = Archive::Zip->new();
    unless ($zip->read($self->filename) == AZ_OK) {
        die "Unable to read zip file " . $self->file . ": " . Archive::Zip::Error::zipErrorString();
    }
    return $zip;
}

method _build_parts() {
    my $rels = $self->zip->memberNamed('_rels/.rels');
    my $xml = XML::LibXML->load_xml(string => scalar $rels->contents());
    my $xpc = XML::LibXML::XPathContext->new($xml);
    $xpc->registerNs(r => $RELATIONSHIPS_NS);
    return [
        map { 
            Parser::OPC::Part->new(
                type   => $_->getAttribute('Type'),
                id     => $_->getAttribute('Id'),
                target => $_->getAttribute('Target'),
                source => $self,
            ) 
        } @{ $xpc->findnodes('//r:Relationship') }
    ];
}

sub _options_match {
    my ($elem, $options) = @_;

    my $count = 0;
    for my $k (keys %$options) {
        $count++ if ref($options->{$k}) eq "Regexp"
            ? $elem->$k() =~ /$options->{$k}/
            : $elem->$k() eq $options->{$k};
    }
    return $count == scalar keys %$options;
}

method find_parts(%patterns) {
    return grep { _options_match($_, \%patterns) } @{ $self->parts };
}

1;

This is rather regular stuff. The filename property is mandatory and is supposed to be specified during the initialization. When the data are accessed via zip or parts properties, the operations are done to get the data. I don’t want the properties to be supplied by constructor, so I prevented it with the init_arg declaration.

The find_parts method allows me to easily filter parts based on exact matches or regex matches. In the SLDD case, we are looking for a part with type http://schemas.mathworks.com/simulink/2010/relationships/dictionaryChunk. Its target is a XML that looks like this

<DataSource FormatVersion="1" MinRelease="R2014a" Arch="win64">
    <Object Class="DD.ENTRY">
        <P Name="Name" Class="char">items</P>
        ...
        <P Name="Value">
            <Element Class="Simulink.Parameter">
                <P Name="Value" Class="single">-0.003634</P>
                ...
                <P Name="DataType" Class="char">single</P>
                ...
            </Element>
        </P>
    </Object>
    ...
</DataSource>

Using the module to extract some of the element values above would look like this

use Parser::OPC;

my $sldd = Parser::OPC->new(filename => 'data.sldd');
for my $part ($sldd->find_parts(type => qr/dictionaryChunk/)) {
    my $dd = $part->contents_xml;
    for my $entry ($dd->findnodes('//Object[@Class="DD.ENTRY"]')) {
        say $entry->findvalue('P[@Name="Name"]');
        say $entry->findvalue('P[@Name="Value"]/Element/P[@Name="Value"]');
        say $entry->findvalue('P[@Name="Value"]/Element/P[@Name="DataType"]');
    }
}

It was rather easy to build quite decent parser for this kind of files and extract the data from it. I am quite happy with how the code turned up and how it can be used.