OAPEN metadata
The OAPEN Metadata telescope collects data from the OAPEN Metadata feed.
OAPEN enables libraries and aggregators to use the metadata of all available titles in the OAPEN Library.
The metadata is available in different formats and this telescope harvests the data in the XML format.
See the OAPEN Metadata webpage for more information.
The corresponding table in BigQuery is onix.onixYYYYMMDD
.
Summary |
|
---|---|
Average runtime |
10min |
Average download size |
150-200MB |
Harvest Type |
URL |
Harvest Frequency |
Weekly |
Runs on remote worker |
False |
Catchup missed runs |
False |
Table Write Disposition |
Append |
Update Frequency |
Daily |
Credentials Required |
No |
Uses Telescope Template |
Stream |
Configuration
Airflow Connections
The OAPEN metadata is freely accessible, so no credentials are required for it.
Schedule
The XML file containing metadata is updated daily at +0000GMT. This telescope is scheduled to harvest the metadata weekly.
Results
The resulting ONIX table will be stored in BigQuery - onix.onixYYYYMMDD
Tasks
Download
This is where the metadata is downloaded. The XML file containing metadata is downloaded using the XML URL that is available on the OAPEN Metadata webpage mentioned above.
Note that if the metadata file is part-way through an update (occurring daily at +0000GMT and taking upwards of one hour), the XML file will be incomplete and invalid. The telescope has a failesafe to attempt to resolve this during runtime, which can lead to much longer than normal ‘download’ times.
Transform
The transform step modifies the downloaded metadata into a valid ONIX format. This is done in two steps:
The XML is loaded and all unnecessary fields are removed. The fields deemed necessary are described by the header and a supplied product schema (.json file)
The resulting XML is parsed through the Python onixcheck. This reveals any remaining invalid products. These products are removed from the file. The removed products are saved to a separate file and uploaded to the transform bucket for storage.
Load to BigQuery
The valid ONIX feed can now be loaded from the transform bucket into a BigQuery sharded table.