Last sessions we learned the how to construct a metafacture workflow, how to use the Playground and how and how Metafacture Fix can be used to parse structured information. Today we will go deeper into Metafacture Fix and describe how to pluck data out of structured information.
Today will we fetch a new weather report with the Metafacture Playground:
"https://fcc-weather-api.glitch.me/api/current?lat=50.93414&lon=6.93147"
| open-http
| as-lines
| decode-json
| encode-yaml
| print
;
We also saw in the previous post how you can use Metafacture to transform the JSON format into the YAML format which is easier to read and contains the same information.
We also learned some fixes to retrieve information out of the JSON file like retain("name","main.temp")
.
In this post we delve a bit deeper into ways how to point to fields in a JSON or a YAML file:
---
coord:
lon: "6.9315"
lat: "50.9341"
weather:
- id: "800"
main: "Clear"
description: "clear sky"
icon: "https://cdn.glitch.com/6e8889e5-7a72-48f0-a061-863548450de5%2F01d.png?1499366022009"
base: "stations"
main:
temp: "12.56"
feels_like: "11.93"
temp_min: "11.62"
temp_max: "14.68"
pressure: "1022"
humidity: "79"
visibility: "10000"
wind:
speed: "1.03"
deg: "0"
clouds:
all: "0"
dt: "1654153727"
sys:
type: "2"
id: "43069"
country: "DE"
sunrise: "1654140191"
sunset: "1654198658"
timezone: "7200"
id: "2886242"
name: "Cologne"
cod: "200"
main.temp
is called a Path that is JSON Path-like and points to a part of the data set - here our Yaml record - you are interested in. The data, as shown above, is structured like a tree.
There are top level simple fields like: base
, cod
, dt
, id
which contain only text values or numbers. Depending on the context simple fields can also be named: elemente, properties, attribute or key.
There are also fields like coord
that contain a deeper structure like lat
and lon
. Nested elements that contain one or more subfields or subelements are also called objects or hash.
Metafacture Fix is using Fix Path, a path-syntax that is JSON Path like but not identical. It also uses the dot notation but there are some differences with the path structure of arrays and repeated fields. Especially when working with JSON or YAML.
Using a JSON path you can point to every part of the JSON file using a dot-notation. For simple top level fields the path is just the name of the field:
base
cod
dt
id
name
For the nested objects with deeper structure you add a dot .
to point to the subfields:
clouds.all
coord.lat
coord.lon
main.temp
So for example. If you would have a deeply nested structure like this object:
x:
y:
z:
a:
b:
c: Hello :-)
Then you would point to the c field with the path to reference the element woulf be x.y.z.a.b.c
.
So lets do some simple excercises:
There are two extra path structures that need to be explained:
In an data set an element sometimes can have multiple instances. Different data models solve this possibility differently. XML-Records can have all elements multiple times, element repition is possible and in many schemas it is (partly) allowed. E.g. the subject element exists three times:
<subject>Metadata</subject>
<subject>Datatransformation</subject>
<subject>ETL</subject>
Repeatable elements also exist e.g. in JSON and YAML but are unusual:
creator: Justus
creator: Peter
creator: Bob
In our two examples the subject
- and creator
-element exists three times. To point to one of the elements you need to use an index. The index is one-based: The first index has value 1, the second the value 2, the third the value 3. So, the path of the creator Bob would be creator.3
. (This is a main difference between Catmandu and Metafacture because Catmandu has an zero based index.)
If you want to refer to all creators then you can use the array wildcard *
which can replace the concrete index number: creator.*
refers to all creator elements. You can also select the the first instance with the array wildcard $first
and the last $last
. This is espacially handy if you do not know how often an element is repeated. When adding an additional repeated element you usually use the $append
wildcard.
In JSON or YAML element repetion is possible but unusual. Instead of repeating elements repetition is constructed as list so that an element can have more than one value. This is called an array and looks like this in YAML:
Our example from above would look like this if creator was a list instead of an repeated field:
creator:
- Justus
- Peter
- Bob
or:
my:
colors:
- black
- red
- yellow
Also lists can be deeply nested, if they are not just lists of strings (array of strings) but of objects (array of objects).
characters:
- name: Justus
role: Investigator
- name: Peter
role: Investigator
- name: Bob
role: Research & Archive
In the example above you see a field my
which contains a deeper field colors
which has 3 values. To point to one of the colors you need to use an index but also genuin arrays have a marker in Metafacture: []
. Also here the first index in a array has value 1, the second the value 2, the third the value 3. The array markers are generated by the JSON-Decoder and the YAML-Decoder. Also if you want to generate an array in the target format, then you need to add []
at the end of an list-element like newArray[]
. (While sofare the path handling of Catmandu and Metafacture are similar, they differ at this point.)
So, the path of the red
would be: my.color[].2
And the path for Peter
would be characters[].2.name
There is one array type in our JSON report from our example at the beginning above and that is the weather
field. To point to the description of the weather you need the path weather[].1.description
.
elements | objects | array/repeated field |
---|---|---|
need path | need dots to mark nested structure | need index/array-wildcards to refer to specific position |
id |
title.subtitle |
author.*.firstName |
name |
very.nested.element |
my.color.2 |
Excercise:
TODO: Solution
In this post we learned the JSON Path syntax and how it can be used to point to parts of a JSON data set want to manipulate. We explained the Fix path using a YAML transformation as example, because this is easier to read.
Especially when working with complex bibliographic data one has to get to know the paths so that you do not have to guess what a path to a certain element is:
There exists multiple ways to find out the path-names of records:
e.g.: Here a way to show pathways in combination with values.
Here is a way to collect and count all paths in all records by using the list-fix-paths
-command.
Other ways are also possible too.
<title>This is the title</title>
The path for the value This is the title
is not title
but title.value
XMLs are not just simple elements with key-pair values or objects with subfields but each elemnt can have additional attributs. In Metafacture the xml decoder (decode-xml
with handle-generic-xml
) groups the attributes and values as subfields of an object.
<title type="mainTitle" lang="eng">This is the title</title>
The path for the different attributs and elements are the following:
title.value
title.type
title.lang
If you want to create xml with attributes then you need to map to this structure too. We will come back to lection working with xml in lesson 10.
Next lessons: 05 More Fix Concepts