Get document category
2023-09-14 C# Word xml System.IO.Packaging System.Xml.XDocument EnumerateFiles GetPart PackageAt work we use Word documents with some helpers to define requirements, so there are many such files. There are different kinds of the documents (high-level requirements, low-level, tests, etc). The kind is stored in the document properties, namely in the category
field.
I needed to parse out the data for all documents, so I can do an analysis. I put together a simple .NET 7.0 project with following nuget packages:
System.IO.Packaging
System.Xml.XDocument
The code is then very straightforward, just enumerate all *.docx
files in an input directory and from each extract the category property. First nuget System.IO.Packaging
allows to easily reach into package and get its part. The part is then just a XML file, so XML parser helps to get to actual properties.
string inputDirectory = @".";
foreach (string file in Directory.EnumerateFiles(inputDirectory, "*.docx", SearchOption.AllDirectories))
{
try
{
var package = Package.Open(file, FileMode.Open, FileAccess.ReadWrite);
var corePart = package.GetPart(new Uri("/docProps/core.xml", UriKind.Relative));
XDocument settings;
using (TextReader tr = new StreamReader(corePart.GetStream()))
settings = XDocument.Load(tr);
XNamespace cp = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
string category = settings.Root.Element(cp + "category")?.Value ?? "<NONE>";
Debug.WriteLine($"{file} - {category}");
}
catch (Exception e)
{
Debug.WriteLine($"Failed processing {file}: {e}");
}
}