Lesson 9: Working with CSV and TSV files

CSV and TSV files are widely-used to store and exchange simple structured data. Many open datasets are published as CSV or TSV files, see e.g. datahub.io. Within the library community CSV files are used for the distribution of title lists (KBART), e.g Knowledge Base+.

Metafacture implements a decoder and an encoder which you can youse for both formats: decode-csv and encode-csv.

Reading CSVs

Get some CSV data to work with:

"https://lib.ugent.be/download/librecat/data/goodreads.csv"
| open-http
| as-lines
| print
;

It shows a CSV file with a header row at the beginning.

Convert the data to different serializations, like JSON, YAML and XML by decoding the data as CSV and encoding it in the desired serialization:

"https://lib.ugent.be/download/librecat/data/goodreads.csv"
| open-http
| as-lines
| decode-csv
| encode-json(prettyPrinting="true") // or encode-xml or encode-yaml
| print
;

See in playground.

See that the elements have no literal names but only numbers. As the CSV has a header we need to add the option (hasHeader="true") to decode-csv in the Flux.

You can extract specified fields while converting to another tabular format by using the Fix. This is quite handy for analysis of specific fields or to generate reports. In the following example we only keep three columns ("ISBN","Title","Author"):

Flux:

"https://lib.ugent.be/download/librecat/data/goodreads.csv"
| open-http
| as-lines
| decode-csv(hasHeader="true")
| fix(transformationFile)
| encode-csv(includeHeader="true")
| print
;

With Fix:

retain("ISBN","Title","Author")

See the example in the Playground

By default Metafactures decode-csv expects that CSV fields are separated by comma , and strings are quoted with double qoutes " or single quotes '. You can specify other characters as separator or quotes with the option separator and clean special quote signs using the Fix. (In contrast to Catmandu quote-chars cannot be manipulated by the decoder directly, yet.)

Flux:

"12157;$The Journal of Headache and Pain$;2193-1801"
| read-string
| as-lines
| decode-csv(separator=";")
| fix(transformationFile)
| encode-csv(separator="\t", includeheader="true")
| print;

Fix:

replace_all("?","^\\$|\\$$","")

See the example in the Playground.

In the example above we read the string as a little CSV fragment using the read-string command for our small test. It will read the tiny CSV string which uses ; and $ as separation and quotation characters. The string is then read each line by as-lines and decoded as csv with the separator ,.

Writing CSVs

When harvesting data in tabular format you also can change the field names in the header or omit the header:

Flux:

"https://lib.ugent.be/download/librecat/data/goodreads.csv"
| open-http
| as-lines
| decode-csv(hasheader="true")
| fix(transformationFile)
| encode-csv(includeHeader="true")
| print;

Fix:

move_field("ISBN","A")
move_field("Title","B")
move_field("Author","C")

retain("A","B","C")

See example in he playground.

You can transform the data to a TSV file with the separator \t which has no header like this:

"https://lib.ugent.be/download/librecat/data/goodreads.csv"
| open-http
| as-lines
| decode-csv(hasheader="true")
| encode-csv(separator="\t", noQuotes="true")
| print;

See example in playground.

When you create a CSV from complex/nested data structures to a tabular format, you must “flatten” the datastructure. Also you have to be aware that the order and number of elements in every record is the same as the header should match the records.

So: make sure that the nested structure of repeatable elements is identical every time. Otherwise the header and the CSV file do not fit.

Excercises:

Next lesson: 10 Working with XML