To perform data processing with Metafacture transformation workflows are configured with Metafacture Flux, a domain-specific scripting language(DSL). With Metafacture Flux we combine different modules for reading, opening, transforming, and writing data sets.
In this lesson we will learn about Metafacture Flux, what Flux workflows are and how to combine different Flux modules to create a workflow in order to process datasets.
To process data Metafacture can be used with the command line, as JAVA library or you can use the Metafacture Playground.
For this introduction we will start with the Playground since it allows a quick start without additional installing. The Metafacture Playground is a webinterface to test and share Metafacture. The commandline handling will be subject in lesson 6. TODO: Add link.
In this tutorial we are going to process structured information. We call data structured when it organised in such a way is that it easy processable by computers. Literary text documents like War and Peace are structured only in words and sentences, but a computer doesn’t know which words are part of the title or which words contain names. We had to tell the computer that. Today we will download a weather report in a structured format called JSON and inspect it with Metafacture.
Lets jump to the Playground to learn how to create workflows:
See the window called Flux?
Copy the following short code sample into the playground:
"Hello, friend. I'am Metafacture!"
|print
;
Great, you have created your first Metafacture Flux Workflow. Congratulations!
Now you can press the Process
-Button () or press Ctrl+Enter to execute the workflow.
See the result below? It is Hello, friend. I'am Metafacture!
.
But what have we done here?
We have a short text string "Hello, friend. I'am Metafacture"
. That is printed with the modul print
.
A Metafacture Workflow is nothing else as a incoming text string with multiple moduls that do something with the incoming string. But the workflow does not have to start with a text string but also can be a variable that stands for the text string and needs to be defined before the workflow. As this:
INPUT="Hello, friend. I'am Metafacture!";
INPUT
|print
;
Copy this into the I window of your playground or just adjust your example.
INPUT
as a varibale is defined in the first line of the flux. And instead of the text string, the Flux-Workflow starts just with the variable INPUT
without "
.
But the result is the same if you process the flux.
Often you want to process data stored in a file.
The playground has a input area called ìnputFile
-Content that pretends to be a local file. It can be adressed with the variable inputFile
. In this inputFile
-text area you can insert data samples.In the Playground you can put the variable inputFile
at the beginning of a MF workflow to process the content of this imaginary file. The variable inputFile
can be used at the beginning
of the workflow and it refers to the input that is written in the Data Window at the top of the playground.
e.g.
So lets use inputFile
instead of INPUT
and copy the value of the text string in the Data field above the Flux.
Data:
Hello, friend. I'am Metafacture!
Flux:
inputFile
|print
;
Höm… There seems to be unusual output. Its a file path. Why?
Because hidden behind the variable inputFile
is a path to a file.
But to read the content of the file we need to handle the incoming file path differently.
(How to open real files you will learn when we learn how to run metafacture on your command line in lesson 06.)
We need to add two additional Metafacture commands: open-file
and as-lines
Flux:
inputFile
| open-file
| as-lines
| print
;
The inputFile is opened as a file(open-file
) and then processed line by line (as-line
).
You can see that in this sample.
We usually do not start with any random text strings but with data. So lets play around with some data.
Let’s start with a link: https://weather-proxy.freecodecamp.rocks/api/current?lat=50.93414&lon=6.93147
You will see data that look like this:
{"coord":{"lon":6.9315,"lat":50.9341},"weather":[{"id":800,"main":"Clear","description":"clear sky","icon":"https://cdn.glitch.com/6e8889e5-7a72-48f0-a061-863548450de5%2F01d.png?1499366022009"}],"base":"stations","main":{"temp":15.82,"feels_like":15.02,"temp_min":14.55,"temp_max":18.03,"pressure":1016,"humidity":60},"visibility":10000,"wind":{"speed":4.63,"deg":340},"clouds":{"all":0},"dt":1654101245,"sys":{"type":2,"id":43069,"country":"DE","sunrise":1654053836,"sunset":1654112194},"timezone":7200,"id":2886242,"name":"Cologne","cod":200}
This is data in JSON format. But it seems not very readable.
But all these fields tell something about somedays weather in Cologne, Germany. You can recognise in the example above that there the sky is clear and the temperature is 15.82 degrees Celsius.
Let’s copy the JSON data from the URL into the inputFile
-content area of the Playground. And run it again.
The output in result is the same as the input and it is still not very readable.
So let’s transform some stuff. Let us use some other serialization. How about YAML. With the metafacture you can process this file to make it a bit easier readable by using a small workflow script. Lets turn the one line of JSON data into YAML. YAML is another format for structured information which is a bit easier to read for human eyes. In order to change the serialization of the data we need to decode the data and then encode the data.
Metafacture has lots of decoder- and encoder-modules for specific data formats that can be used in an Flux-Workflow.
Let’s try this out. Add the module decode-json
and encode-yaml
to your Flux Workflow.
The Flux should now look like this:
Flux:
inputFile
| open-file
| as-lines
| decode-json
| encode-yaml
| print
;
When you process the data our weather report should now look like this:
---
coord:
lon: "6.9315"
lat: "50.9341"
weather:
- id: "800"
main: "Clear"
description: "clear sky"
icon: "https://cdn.glitch.com/6e8889e5-7a72-48f0-a061-863548450de5%2F01d.png?1499366022009"
base: "stations"
main:
temp: "15.82"
feels_like: "15.02"
temp_min: "14.55"
temp_max: "18.03"
pressure: "1016"
humidity: "60"
visibility: "10000"
wind:
speed: "4.63"
deg: "340"
clouds:
all: "0"
dt: "1654101245"
sys:
type: "2"
id: "43069"
country: "DE"
sunrise: "1654053836"
sunset: "1654112194"
timezone: "7200"
id: "2886242"
name: "Cologne"
cod: "200"
This is better readble right?
But we cannot only open the data we have in our data field we also can open stuff on the web:
Instead of using inputFile
lets read the live weather data which is provided by the URL from above:
Clear your playground and copy the following Flux-Workflow:
"https://weather-proxy.freecodecamp.rocks/api/current?lat=50.93414&lon=6.93147"
| open-http
| as-lines
| decode-json
| encode-yaml
| print
;
The result in the playground should be the same as before but with the module open-http
you can get the text that is provided via an url.
But lets understand what a Flux Workflow does. The Flux-Workflow is combination of different moduls to process incoming semi structured data. In our example we have different things that we do with these modules:
First we have a URL.
The URL state the location of the data on the web.
First we tell Metafacture open-http
to request the stated url.
Then how to handle the data that is incoming: Since the report is writen in one line, we tell Metafacture to regard every new line as a new record with as-lines
Afterwards we tell Metafacture to decode-json
in order to translate the incoming data as json to the generic internal data model that is called metadata events
Lastly Metafacture should serialize the metadata events as YAML with encode-yaml
Finally we tell MF to print
everything.
So let’s have a small recap of what we done and learned so far.
We played around with the Metafacture Playground.
We learned that Metafacture Flux Workflow is a combination modules with an inital text string or an variable.
We got to know different modules like open-http
, as-lines
. decode-json
, encode-yaml
, print
More modules can be found in the documentation of available flux commands.
Now take some time and play around a little bit more and use some other modules.
1) Try to change the Flux workflow to output as formeta (a metafacture specific data format) and not as YAML. 2) Configure the style of formeta to multiline. 3) Also try not to print but to write the output and call the file that you write weather.xml.
"https://weather-proxy.freecodecamp.rocks/api/current?lat=50.93414&lon=6.93147"
| open-http
| as-lines
| decode-json
| encode-formeta(style="multiline")
| write("test.xml")
;
What you see with the modules encode-formeta
and write
is that modules can have further specification in brackets.
These can eiter be a string in "..."
or attributes that define options as with style=
.
One last thing you should learn on an abstract level is to grasp the general idea of Metafacture Flux workflows is that they have many different moduls through which the data is flowing. The most abstract and most common process resemble the following steps:
→ read → decode → transform → encode → write →
This process is one that Transforms incoming data in a way that is changed at the end. Each step can be done by one or a combination of multiple modules. Modules are small tools that do parts of the complete task we want to do.
Each modul demands a certain input and give a certain output. This is called signature. e.g.:
The fist modul open-file
expects a string and provides read data (called reader).
This reader data can be passed on to a modul that accepts reader data e.g. in our case as-lines
as-lines
outputs again a string, that is accepted by the decode-json
module.
If you have a look at the flux modul/command documentation then you see under signature which data a modul expects and which data it outputs.
The combination of moduls is a Flux-Workflow.
Each module is separated by a |
and every workflow ends with a ;
.
Comments can be added with //
.
See:
//input string:
"https://weather-proxy.freecodecamp.rocks/api/current?lat=50.93414&lon=6.93147"
// MF Workflow:
| open-http
| as-lines
| decode-json
| encode-formeta(style="multiline")
| write("test.xml")
;
1) Try to prettyprint the online weather report in JSON.
1) Have a look at decode-xml
what is different to decode-json
? And what input does it expect and what output does it create (Hint: signature)?
3) Fill out the blanks in the metafacture workflow to transform some local Pica-Data to YAML.
4) Collect MARC-XML from the WEB and transform them to JSON
As you surely already saw I mentioned transform as one step in a metafacture workflow.
But aside from changing the serialisation we did not play around with transformations yet. This will be the theme of the next session.
Next lesson: 03 Introduction into Metafacture-Fix