Thoth

The Thoth Telescope downloads, transforms and loads publisher ONIX feeds from Thoth into BigQuery. ONIX is a standard format that book publishers use to share information about the books that they have published.

Thoth is a free, open metadata service that publishers can choose to utilise as a solution for metadata storage. Thoth can provide metadata upon request in a number of formats. The Thoth Telescope used the Thoth Export API to download metadata in an ONIX format. This API provides a snapshot of a specified publisher’s metadata at the time of request. It requires the publisher’s ID as part of the URL, which can be found using the GraphiQL API.

The Thoth telescope downloads the ONIX metadata files and then transforms the data into a format suitable for loading into BigQuery with the ONIX parser Java command line tool. This is a near-identical process to how the ONIX telescope’s data-transformation step is executed. The transformed data is loaded into BigQuery, where it can be picked up and used by the ONIX Workflow.

The corresponding table in BigQuery is onix.onixYYYYMMDD.

Summary

Average runtime

5-10 mins

Average download size

1-10 MB

Harvest Type

URL

Harvest Frequency

Weekly

Runs on remote worker

False

Catchup missed runs

False

Credentials Required

No

Uses Telescope Template

None

Each shard includes all data

Yes

Configuration

The following settings need to be configured for the Thoth telescope.

Telescope API Instance

A Thoth Telescope API instance needs to be created. Unlike the ONIX telescope, it does not require any ‘extra’ fields.

Airflow Connections

The Thoth telescope does not require any airflow connections to run, as the Thoth API is freely usable.

Latest schema