oaebu_workflows.thoth_telescope.thoth_telescope

Module Contents

Classes

ThothRelease

Construct a ThothRelease.

ThothTelescope

Construct an ThothOnixTelescope instance.

Functions

thoth_download_onix(publisher_id, download_path, ...)

Hits the Thoth API and requests the ONIX feed for a particular publisher.

Attributes

THOTH_URL

DEFAULT_HOST_NAME

oaebu_workflows.thoth_telescope.thoth_telescope.THOTH_URL = '{host_name}/specifications/{format_specification}/publisher/{publisher_id}'[source]
oaebu_workflows.thoth_telescope.thoth_telescope.DEFAULT_HOST_NAME = 'https://export.thoth.pub'[source]
class oaebu_workflows.thoth_telescope.thoth_telescope.ThothRelease(*, dag_id, run_id, snapshot_date)[source]

Bases: observatory.platform.workflows.workflow.SnapshotRelease

Construct a ThothRelease. :param dag_id: The ID of the DAG :param run_id: The Airflow run ID :param release_date: The date of the snapshot_date/release

Parameters:
  • dag_id (str) –

  • run_id (str) –

  • snapshot_date (pendulum.datetime.DateTime) –

class oaebu_workflows.thoth_telescope.thoth_telescope.ThothTelescope(*, dag_id, cloud_workspace, publisher_id, format_specification, elevate_related_products=False, metadata_partner='thoth', bq_dataset_description='Thoth ONIX Feed', bq_table_description=None, api_dataset_id='onix', host_name='https://export.thoth.pub', observatory_api_conn_id=AirflowConns.OBSERVATORY_API, catchup=False, start_date=pendulum.datetime(2022, 12, 1), schedule='@weekly')[source]

Bases: observatory.platform.workflows.workflow.Workflow

Construct an ThothOnixTelescope instance. :param dag_id: The ID of the DAG :param cloud_workspace: The CloudWorkspace object for this DAG :param publisher_id: The Thoth ID for this publisher :param format_specification: The Thoth ONIX/metadata format specification. e.g. “onix_3.0::oapen” :param elevate_related_products: Whether to pull out the related products to the product level. :param metadata_partner: The metadata partner name :param bq_dataset_description: Description for the BigQuery dataset :param bq_table_description: Description for the biguery table :param api_dataset_id: The ID to store the dataset release in the API :param host_name: The Thoth host name :param observatory_api_conn_id: Airflow connection ID for the overvatory API :param catchup: Whether to catchup the DAG or not :param start_date: The start date of the DAG :param schedule: The schedule interval of the DAG

Parameters:
  • dag_id (str) –

  • cloud_workspace (observatory.platform.observatory_config.CloudWorkspace) –

  • publisher_id (str) –

  • format_specification (str) –

  • elevate_related_products (bool) –

  • metadata_partner (Union[str, oaebu_workflows.oaebu_partners.OaebuPartner]) –

  • bq_dataset_description (str) –

  • bq_table_description (Optional[str]) –

  • api_dataset_id (str) –

  • host_name (str) –

  • observatory_api_conn_id (str) –

  • catchup (bool) –

  • start_date (pendulum.datetime.DateTime) –

  • schedule (str) –

make_release(**kwargs)[source]

Creates a new Thoth release instance

Parameters:

kwargs – the context passed from the PythonOperator.

Return type:

ThothRelease

See https://airflow.apache.org/docs/stable/macros-ref.html for the keyword arguments that can be passed :return: The Thoth release instance

download(release, **kwargs)[source]

Task to download the ONIX release from Thoth.

Parameters:

release (ThothRelease) – The Thoth release instance

Return type:

None

upload_downloaded(release, **kwargs)[source]

Upload the downloaded thoth onix XML to google cloud bucket

Parameters:

release (ThothRelease) –

Return type:

None

transform(release, **kwargs)[source]

Task to transform the Thoth ONIX data

Parameters:

release (ThothRelease) –

Return type:

None

upload_transformed(release, **kwargs)[source]

Upload the downloaded thoth onix .jsonl to google cloud bucket

Parameters:

release (ThothRelease) –

Return type:

None

bq_load(release, **kwargs)[source]

Task to load the transformed ONIX jsonl file to BigQuery.

Parameters:

release (ThothRelease) –

Return type:

None

add_new_dataset_releases(release, **kwargs)[source]

Adds release information to API.

Parameters:

release (ThothRelease) –

Return type:

None

cleanup(release, **kwargs)[source]

Delete all files, folders and XComs associated with this release.

Parameters:

release (ThothRelease) –

Return type:

None

oaebu_workflows.thoth_telescope.thoth_telescope.thoth_download_onix(publisher_id, download_path, format_spec, host_name=DEFAULT_HOST_NAME, num_retries=3)[source]

Hits the Thoth API and requests the ONIX feed for a particular publisher. Creates a file called onix.xml at the specified location

Parameters:
  • publisher_id (str) – The ID of the publisher. Can be found using Thoth GraphiQL API

  • download_path (str) – The path to download ONIX the file to

  • format_spec (str) – The ONIX format specification to use. Options can be found with the /formats endpoint of the API

  • host_name (str) – The Thoth host URL

  • num_retries (int) – The number of times to retry the download, given an unsuccessful return code

Return type:

None