metafacture-tutorial

Lesson 4: FixPath and more complex transformations in Fix

Over the last lessons we learned how to construct a Metafacture workflow, how to use the Playground and how Metafacture Flux and Fix can be used to parse structured information. We saw how you can use Flux to transform the JSON format into the YAML format which is easier to read and contains the same information. We also learned how to retrieve information out of the JSON file using a Fix function like retain("title", "publish_date", "notes.value", "type.key").

In this lesson we will go deeper into Metafacture Fix and describe how to pluck data out of structured information.

First, let’s fetch of a new book with the Metafacture Playground:

"https://openlibrary.org/books/OL27333998M.json"
| open-http
| as-lines
| decode-json
| encode-yaml
| print
;

The printed output in YAML format contains a collection of key:value pairs.

There are top level fields like: title or publish_date which contain only text values or numbers:

---
title: "Bullshit Jobs"

There are also fields like created that contain a nested structure where the value is again a key:value pair:

---
created:
  type: "/type/datetime"
  value: "2019-10-04T04:03:07.194846"

And fields like source_records[] do have a list as value:

---
source_records:
- "amazon:1501143336"
- "bwb:9781501143335"
- "marc:marc_columbia/Columbia-extract-20221130-034.mrc:71583959:3725"
- "promise:bwb_daily_pallets_2023-05-10:W8-BRV-242"

Metafacture Fix is using FixPath to access and selectively extract data out of (semi) structured documents. It is JSONPath like but not identical. It also uses the dot notation but there are some differences with the path structure of arrays and repeated fields.

You can point to every part of the YAML file using a dot-notation. For simple top level fields the path is just the name of the field:

For nested objects with deeper structure you add a dot . to point to the subfields:

If you would have a deeply nested structure like this object:

x:
  y:
    z:
      a:
        b:
          c: Hello :-)

Then you would point to the c field with this path: x.y.z.a.b.c.

So lets do some simple excercises:

Try and complete the fix functions. Transform the element a into title and combine the subfields of b and c to the element author.

Answer See here

Repeated fields and arrays

There are two extra path structures that need to be explained:

For both repeated fields and arrays you need to use an index to select an element.

For YAML and JSON-arrays specifically you also need to use an array marker that are generated by the YAML- and JSON- decoders can be interpreted by the encoders for YAML and JSON.

Working with repeated fields

In a data set an element sometimes can have multiple instances:

creator: Justus
creator: Peter
creator: Bob

To point to one of the creator elements you need to use an index. The first index has value 1, the second the value 2, the third the value 3. So, the path of the creator Bob would be creator.3. (In contrast, Catmandu uses an zero based index starting with 0 as the first index.)

If you want to refer to all creators then you can use the * sign as a wildcard: creator.* refers to all creator elements. The first instance can be selected by the $first wildcard and the last by $last. This is espacially handy if you do not know how often an element is repeated. When adding an additional repeated element you can use the $append or $prepend wildcards.

append the correct last name to the three investigators: Justus Jonas, Peter Shaw and Bob Andrews. Also prepend “Investigator” to all of them.

Answer FIX: ```perl append("creator.1"," Jonas") append("creator.2"," Shaw") append("creator.3"," Andrews") prepend("creator.*","Investigator ") ``` See here

Hint: Sometimes a repeatable field only can appear only once or not at all. If the record only provide the element once Metafacture (as Catmandu does as well) interpretes the single appearance of an field not a list but as a simple field or an object. You have to adjust your transformation to meet both scenarios. One way how to deal with this is the list bind, which is agnostic to how often an element is provided. The list bind will be introduced in the next session 05.

Working with JSON and YAML arrays

In JSON or YAML element repetion is possible but unusual. Instead of repeating elements an element can have a list or array of values.

In our book example we have an array as value:

source_records:
  - "amazon:1501143336"
  - "bwb:9781501143335"
  - "marc:marc_columbia/Columbia-extract-20221130-034.mrc:71583959:3725"
  - "promise:bwb_daily_pallets_2023-05-10:W8-BRV-242"

Our example from above would look like this if creator was a list instead of a repeated field:

creator:
	- Justus
	- Peter
	- Bob

Lists can be deeply nested if the values are not just strings (list of strings) but objects (list of objects):

characters:
  - name: Justus
    role: Investigator
  - name: Peter
    role: Investigator
  - name: Bob
    role: Research & Archive

Here is another example:

my:
  colors:
    - black
    - red
    - yellow

To point to one of the colors you need to use an index but also an array marker: [].

So, the path of the red would be: my.colors[].2

And the path for Peter would be characters[].2.name

Also if you want to generate an array in the target format JSON or YAML, then you need to add [] at the end of an list element like newArray[].

Excercise:

Only retain the title and the series plus the name and role of Bob Andrews. You have to identify the paths for said elements.

Answer See here

Again append the last names to the specific character Justus Jonas, Peter Shaw and Bob Andrews. Also add a field to each character “type”:”Person”`

Answer See here

In this post we learned the FixPath syntax and how it can be used to point to parts of a YAML or JSON data set we want to manipulate.

Especially when working with complex bibliographic data one has to get to know the paths so that you do not have to guess what a path to a certain element is.

There exists multiple ways to find out the path-names of records. Two examples:

1) Here a way to show pathways in combination with values.

2) Here is a way to collect and count all paths in all records by using the list-fix-paths command.

Bonus: XML in MF and their paths

<title>This is the title</title>

The path for the value This is the title is not title but title.value

XMLs are not just simple elements with key-pair values or objects with subfields but each elemnt can have additional attributs. In Metafacture the xml decoder (decode-xml with handle-generic-xml) groups the attributes and values as subfields of an object.

<title type="mainTitle" lang="eng">This is the title</title>

The path for the different attributs and elements are the following:

title.value
title.type
title.lang

If you want to create xml with attributes then you need to map to this structure too. We will come back to lection working with xml in lesson 10.


Next lesson: 05 More Fix Concepts