Lesson 10: Working with XML
CSV is one type of file format that is used for data exchange. XML is a famous other one.
XML files are used as internal data serialization and exchange serialization. Also a lot of metadata profils and data formats in the cultural heritage sector are serialized in XML: e.g. LIDO, MODS, METS, PREMIS, MARCXML, PICAXML, DC, ONIX and so on.
XML is decoded a little bit differently than other data serializations since on the one hand the XML serialization is to be decoded and then the inherent format is to be treated accordingly.
Lets start with this simple record
<?xml version="1.0" encoding="utf-8"?>
<record>
<title>GRM</title>
<author>Sibille Berg</author>
<datePublished>2019</datePublished>
</record>
Lets open it with the following Flux:
inputFile
| open-file
| as-records
| print
;
See it here in the Playground.
Next let’s decode the file and encode it as YAML.
T decode it as XML the decode-xml
command is used. But using only the decoder does not help. We additionally need a handler for XML. Handlers are specific helpers that de-format XML in a certain way, based on the metadata standard that this XML is based on.
For now we need the handle-generic-xml
function:
inputFile
| open-file
| decode-xml
| handle-generic-xml
| encode-yaml
| print
;
See it here in the Playground.
You see this as result:
---
title:
value: "GRM"
author:
value: "Sibille Berg"
datePublished:
value: "2019"
What is special about the handling is that the values of the different XML elements are not decoded straigt as the value of the element but as a subfield called value
. This is due to the fact that XML elementis can’t have a value and additional attributes and to catch both MF introduces subfields for the value and potential attributes.
See:
<title attribute="test">Test value</title>
with the Flux:
inputFile
| open-file
| decode-xml
| handle-generic-xml
| encode-yaml
| print
;
results in:
title:
attribute: "test"
value: "Test value"
For our example above, to get rid of the value subfields in the yaml, we need to change the hirachy:
inputFile
| open-file
| decode-xml
| handle-generic-xml
| fix(transformationFile)
| encode-yaml
| print
;
with Fix:
move_field("title.value","@title")
move_field("@title","title")
move_field("author.value","@author")
move_field("@author","author")
move_field("datePublished.value","@datePublished")
move_field("@datePublished","datePublished")
But when it’s encoded into XML the value subfields are also kept. Like this:
inputFile
| open-file
| decode-xml
| handle-generic-xml
| encode-xml
| print
;
results in:
<?xml version="1.0" encoding="UTF-8"?>
<records>
<record>
<title>
<value>GRM</value>
</title>
<author>
<value>Sibille Berg</value>
</author>
<datePublished>
<value>2019</value>
</datePublished>
</record>
</records>
Keep in mind that XML elements can have attributes and a value. The encoder encode-xml
enables simple flat XML records, when specified to.
You have to add a specific option when encoding XML: | encode-xml(valueTag="value")
. Then it results in:
<?xml version="1.0" encoding="UTF-8"?>
<records>
<record>
<title>GRM</title>
<author>Sibille Berg</author>
<datePublished>2019</datePublished>
</record>
</records>
If you want to create the other elements as attributes you have to tell MF which elements are attributes by adding an “attribute marker” with the option attributemarker
in handle-generic-xml
. Here an @
acts as the attribute marker:
inputFile
| open-file
| decode-xml
| handle-generic-xml(attributeMarker="@")
| encode-xml(attributeMarker="@",valueTag="value")
| print
;
When you encode it as YAML you see the magic behind it:
inputFile
| open-file
| decode-xml
| handle-generic-xml(attributeMarker="@")
| encode-yaml
| print
;
Another important thing when working with XML data sets is to specify the record tag. The default record tag is “record”. But other data sets have different tags to separate records:
"http://www.lido-schema.org/documents/examples/LIDO-v1.1-Example_FMobj00154983-LaPrimavera.xml"
| open-http
| decode-xml
| handle-generic-xml(recordtagname="lido")
| encode-yaml
| print
;
Bonus: Working with namespaces
XML elements often come with namespaces. By default namespaces are not emitted, only the element names are provided. When elements have the name but belong to different namespaces, or you want to emit the incoming namespaces you can use the option emitnamespace="true"
for the handle-generic-xml
command.
Add this option to the previous example and see that there are elements belonging to lido as well as skos.
"http://www.lido-schema.org/documents/examples/LIDO-v1.1-Example_FMobj00154983-LaPrimavera.xml"
| open-http
| decode-xml
| handle-generic-xml(recordtagname="lido", emitnamespace="true")
| encode-yaml
| print
;
See this in the Playground here.
When you want to add the namespace definition to the output Metafacture does not know that by itself. So you have to tell Metafacture the new namespace when encoding-xml
either by a file with the option namespacefile
or in the Flux with the option namespaces
, where the multiple namepaces are separated by an \n
.
See here an example for adding namespaces in the flux:
inputFile
| open-file
| as-lines
| decode-formeta
| fix(transformationFile)
| encode-xml(rootTag="collection",namespaces="__default=http://www.w3.org/TR/html4/\ndcterms=http://purl.org/dc/terms/\nschema=http://schema.org/")
| print
;
TODO: Add excercises.
Next lesson: 11 Mapping Marc to Dublin Core