Over the last lessons we learned how to construct a Metafacture workflow, how to use the Playground and how Metafacture Flux and Fix can be used to parse structured information. We saw how you can use Flux to transform the JSON format into the YAML format which is easier to read and contains the same information. We also learned how to retrieve information out of the JSON file using a Fix function like retain("title", "publish_date", "notes.value", "type.key")
.
In this lesson we will go deeper into Metafacture Fix and describe how to pluck data out of structured information.
First, let’s fetch of a new book with the Metafacture Playground:
"https://openlibrary.org/books/OL27333998M.json"
| open-http
| as-lines
| decode-json
| encode-yaml
| print
;
The printed output in YAML format contains a collection of key:value pairs.
There are top level fields like: title
or publish_date
which contain only text values or numbers:
---
title: "Bullshit Jobs"
There are also fields like created
that contain a nested structure where the value is again a key:value pair:
---
created:
type: "/type/datetime"
value: "2019-10-04T04:03:07.194846"
And fields like source_records[]
do have a list as value:
---
source_records:
- "amazon:1501143336"
- "bwb:9781501143335"
- "marc:marc_columbia/Columbia-extract-20221130-034.mrc:71583959:3725"
- "promise:bwb_daily_pallets_2023-05-10:W8-BRV-242"
Metafacture Fix is using FixPath to access and selectively extract data out of (semi) structured documents. It is JSONPath like but not identical. It also uses the dot notation but there are some differences with the path structure of arrays and repeated fields.
You can point to every part of the YAML file using a dot-notation. For simple top level fields the path is just the name of the field:
title
publish_date
notes
latest_revision
For nested objects with deeper structure you add a dot .
to point to the subfields:
type.key
created.type
last_modified.value
If you would have a deeply nested structure like this object:
x:
y:
z:
a:
b:
c: Hello :-)
Then you would point to the c field with this path: x.y.z.a.b.c
.
So lets do some simple excercises:
There are two extra path structures that need to be explained:
For both repeated fields and arrays you need to use an index to select an element.
For YAML and JSON-arrays specifically you also need to use an array marker that are generated by the YAML- and JSON- decoders can be interpreted by the encoders for YAML and JSON.
In a data set an element sometimes can have multiple instances:
creator: Justus
creator: Peter
creator: Bob
To point to one of the creator
elements you need to use an index. The first index has value 1, the second the value 2, the third the value 3. So, the path of the creator Bob would be creator.3
. (In contrast, Catmandu uses an zero based index starting with 0 as the first index.)
If you want to refer to all creators then you can use the *
sign as a wildcard: creator.*
refers to all creator elements. The first instance can be selected by the $first
wildcard and the last by $last
. This is espacially handy if you do not know how often an element is repeated. When adding an additional repeated element you can use the $append
or $prepend
wildcards.
Hint: Sometimes a repeatable field only can appear only once or not at all. If the record only provide the element once Metafacture (as Catmandu does as well) interpretes the single appearance of an field not a list but as a simple field or an object. You have to adjust your transformation to meet both scenarios. One way how to deal with this is the list bind, which is agnostic to how often an element is provided. The list bind will be introduced in the next session 05.
In JSON or YAML element repetion is possible but unusual. Instead of repeating elements an element can have a list or array of values.
In our book example we have an array as value:
source_records:
- "amazon:1501143336"
- "bwb:9781501143335"
- "marc:marc_columbia/Columbia-extract-20221130-034.mrc:71583959:3725"
- "promise:bwb_daily_pallets_2023-05-10:W8-BRV-242"
Our example from above would look like this if creator was a list instead of a repeated field:
creator:
- Justus
- Peter
- Bob
Lists can be deeply nested if the values are not just strings (list of strings) but objects (list of objects):
characters:
- name: Justus
role: Investigator
- name: Peter
role: Investigator
- name: Bob
role: Research & Archive
Here is another example:
my:
colors:
- black
- red
- yellow
To point to one of the colors you need to use an index but also an array marker: []
.
So, the path of the red
would be: my.colors[].2
And the path for Peter
would be characters[].2.name
Also if you want to generate an array in the target format JSON or YAML, then you need to add []
at the end of an list element like newArray[]
.
In this post we learned the FixPath syntax and how it can be used to point to parts of a YAML or JSON data set we want to manipulate.
Especially when working with complex bibliographic data one has to get to know the paths so that you do not have to guess what a path to a certain element is.
There exists multiple ways to find out the path-names of records. Two examples:
1) Here a way to show pathways in combination with values.
2) Here is a way to collect and count all paths in all records by using the list-fix-paths
command.
<title>This is the title</title>
The path for the value This is the title
is not title
but title.value
XMLs are not just simple elements with key-pair values or objects with subfields but each elemnt can have additional attributs. In Metafacture the xml decoder (decode-xml
with handle-generic-xml
) groups the attributes and values as subfields of an object.
<title type="mainTitle" lang="eng">This is the title</title>
The path for the different attributs and elements are the following:
title.value
title.type
title.lang
If you want to create xml with attributes then you need to map to this structure too. We will come back to lection working with xml in lesson 10.
Next lesson: 05 More Fix Concepts