Lesson 9: Working with CSV and TSV files
CSV and TSV files are widely-used to store and exchange simple structured data. Many open datasets are published as CSV or TSV files, see e.g. datahub.io. Within the library community CSV files are used for the distribution of title lists (KBART), e.g Knowledge Base+.
Metafacture implements a decoder and an encoder which you can youse for both formats: decode-csv
and encode-csv
.
Reading CSVs
Get some CSV data to work with:
"https://lib.ugent.be/download/librecat/data/goodreads.csv"
| open-http
| as-lines
| print
;
It shows a CSV file with a header row at the beginning.
Convert the data to different serializations, like JSON, YAML and XML by decoding the data as CSV and encoding it in the desired serialization:
"https://lib.ugent.be/download/librecat/data/goodreads.csv"
| open-http
| as-lines
| decode-csv
| encode-json(prettyPrinting="true") // or encode-xml or encode-yaml
| print
;
See that the elements have no literal names but only numbers. As the CSV has a header we need to add the option (hasHeader="true")
to decode-csv
in the Flux.
You can extract specified fields while converting to another tabular format by using the Fix. This is quite handy for analysis of specific fields or to generate reports. In the following example we only keep three columns ("ISBN"
,"Title"
,"Author"
):
Flux:
"https://lib.ugent.be/download/librecat/data/goodreads.csv"
| open-http
| as-lines
| decode-csv(hasHeader="true")
| fix(transformationFile)
| encode-csv(includeHeader="true")
| print
;
With Fix:
retain("ISBN","Title","Author")
See the example in the Playground
By default Metafactures decode-csv
expects that CSV fields are separated by comma ,
and strings are quoted with double qoutes "
or single quotes '
. You can specify other characters as separator or quotes with the option separator
and clean special quote signs using the Fix. (In contrast to Catmandu quote-chars cannot be manipulated by the decoder directly, yet.)
Flux:
"12157;$The Journal of Headache and Pain$;2193-1801"
| read-string
| as-lines
| decode-csv(separator=";")
| fix(transformationFile)
| encode-csv(separator="\t", includeheader="true")
| print;
Fix:
replace_all("?","^\\$|\\$$","")
See the example in the Playground.
In the example above we read the string as a little CSV fragment using the read-string
command for our small test. It will read the tiny CSV string which uses ;
and $
as separation and quotation characters. The string is then read each line by as-lines
and decoded as csv with the separator ,
.
Writing CSVs
When harvesting data in tabular format you also can change the field names in the header or omit the header:
Flux:
"https://lib.ugent.be/download/librecat/data/goodreads.csv"
| open-http
| as-lines
| decode-csv(hasheader="true")
| fix(transformationFile)
| encode-csv(includeHeader="true")
| print;
Fix:
move_field("ISBN","A")
move_field("Title","B")
move_field("Author","C")
retain("A","B","C")
You can transform the data to a TSV file with the separator \t
which has no header like this:
"https://lib.ugent.be/download/librecat/data/goodreads.csv"
| open-http
| as-lines
| decode-csv(hasheader="true")
| encode-csv(separator="\t", noQuotes="true")
| print;
When you create a CSV from complex/nested data structures to a tabular format, you must “flatten” the datastructure. Also you have to be aware that the order and number of elements in every record is the same as the header should match the records.
So: make sure that the nested structure of repeatable elements is identical every time. Otherwise the header and the CSV file do not fit.
Excercises:
- Decode this CSV while keeping the header.
- Create a TSV with the record idenfier (
_id
), title (245
>title
) and isbn (020
>isbn
) from a marc dump.
Next lesson: 10 Working with XML