oaebu_workflows.onix_utils

Module Contents

Classes

OnixParser

Class for storing infromation on the java ONIX parser

OnixTransformer

Constructor for the MetadataTransformer class.

OnixProduct

Represents a single ONIX product and its identifying reference for simplicity

Functions

onix_parser_download([download_dir])

Downloads the ONIX parser from Github

onix_parser_execute(parser_path, input_dir, output_dir)

Executes the Java ONIX parser. Requires a .xml file in the input directory.

collapse_subjects(onix)

The book product table creation requires the keywords (under Subjects.SubjectHeadingText) to occur only once

create_personname_fields(onix_products)

Given an ONIX product list, attempts to populate the Contributors.PersonName and/or Contributors.PersonNameInverted

elevate_product_identifiers(related_product)

Given a single <RelatedProduct>, returns a list of <RelatedProduct> elements by elevating

normalise_related_products(onix_products)

Many products have incorrect related products. The related products should be a list of <RelatedProduct> but is

deduplicate_related_products(onix_products)

Removes any duplicated <RelatedProduct> elements. Will also remove parent ISBNs present in the <RelatedProduct>

elevate_related_products(onix_products)

Makes a copy of an ONIX product for each of its unique related products with an ISBN.

_get_product_isbn(product)

Finds the ISBN of a product or relatedproduct

_get_related_products(product)

Finds the related products of a product

_get_product_identifiers(product)

Finds the product identifiers of a product

find_onix_product(all_lines, line_index)

Finds the range of lines encompassing a <Product> tag, given a line_number that is contained in the product

remove_invalid_products(input_xml, output_xml[, ...])

Attempts to validate the input xml as an ONIX file. Will remove any products that contain errors.

filter_through_schema(input, schema)

This function recursively traverses the input dictionary and compares it to the provided schema.

class oaebu_workflows.onix_utils.OnixParser[source]

Class for storing infromation on the java ONIX parser

Parameters:
  • filename – The name of the java ONIX parser file

  • url – The url to use for downloading the parser

  • template (bash) – The path to the bash template “onix_parser.sh.jinja2”

filename = 'coki-onix-parser-1.2-SNAPSHOT-shaded.jar'[source]
url = 'https://github.com/The-Academic-Observatory/onix-parser/releases/download/v1.3.0/coki-onix-parser...'[source]
cmd = 'java -jar {parser_path} {input_dir} {output_dir}'[source]
class oaebu_workflows.onix_utils.OnixTransformer(*, input_path, output_dir, filter_products=False, error_removal=False, normalise_related_products=False, deduplicate_related_products=False, elevate_related_products=False, add_name_fields=False, collapse_subjects=False, filter_schema=os.path.join(schema_folder(workflow_module='oapen_metadata_telescope'), 'oapen_metadata_filter.json'), invalid_products_name='invalid_products.xml', save_format='jsonl.gz', keep_intermediate=False)[source]

Constructor for the MetadataTransformer class.

Parameters:
  • input_path (str) – The path to the metadata file

  • output_dir (str) – The directory to output the transformed metadata

  • filter_products (bool) – Filter the metadata through a filter schema

  • error_removal (bool) – Remove products containing errors

  • normalise_related_products (bool) – Fix imporperly formatted related products

  • deduplicate_related_products (bool) – Deduplicate related products

  • add_name_fields (bool) – Add the Contributor.PersonName and Contributor.InvertedPersonName fields where possible

  • collapse_subjects (bool) – Collapse subjects into semicolon-separated strings

  • filter_schema (str) – The filter schema to use. Required if filter_products is True

  • invalid_products_name (str) – The name of the invalid products file.

  • save_format (Literal[json, jsonl, jsonl.gz]) – The format to save the transformed metadata in - json, jsonl, or jsonl.gz

  • keep_intermediate (bool) – Keep the intermediate files

  • elevate_related_products (bool) –

property current_metadata: List[dict] | Mapping[str, Any][source]
Return type:

Union[List[dict], Mapping[str, Any]]

__del__()[source]
transform()[source]

Transform the oapen metadata XML file based on the supplied options.

The transformations will be done in the following order. Transforms not included will be skipped: 1) Filter the XML metadata using a schema to keep the desired fields only 2) Remove remaining products containing errors 3) Fix incorrectly formatted related products 4) Elevate related products to the product level 5) Construct the Contributor.PersonName and Contributor.InvertedPersonName fields where possible 6) Parse through the java parser to return .jsonl format - This is always done 7) Collapse subjects into semicolon-separated strings

_intermediate_file_path(file_name)[source]
_save_metadata(metadata, file_path)[source]
Parameters:
  • metadata (Union[List[dict], Mapping[str, Any]]) –

  • file_path (str) –

_load_metadata(file_path)[source]
Parameters:

file_path (str) –

_filter_products()[source]
_remove_errors()[source]
_apply_parser()[source]
_apply_name_fields()[source]
_collapse_subjects()[source]
oaebu_workflows.onix_utils.onix_parser_download(download_dir=observatory_home('bin'))[source]

Downloads the ONIX parser from Github

Parameters:

download_dir (str) – The directory to download the file to

Returns:

(Whether the download operation was a success, The (expected) location of the downloaded file)

Return type:

Tuple[bool, str]

oaebu_workflows.onix_utils.onix_parser_execute(parser_path, input_dir, output_dir)[source]

Executes the Java ONIX parser. Requires a .xml file in the input directory.

Parameters:
  • parser_path (str) – Filepath of the parser

  • input_dir (str) – The input directory - first argument of the parser

  • output_dir (str) – The output directory - second argument of the parser

Returns:

Whether the task succeeded or not (return code 0 means success)

Return type:

bool

oaebu_workflows.onix_utils.collapse_subjects(onix)[source]

The book product table creation requires the keywords (under Subjects.SubjectHeadingText) to occur only once Some ONIX feeds return all keywords as separate entires. This function finds and collapses each keyword into a semi-colon separated string. Other common separators will be replaced with semi-colons.

Parameters:

onix (List[dict]) – The onix feed

Returns:

The onix feed after collapsing the keywords of each row

Return type:

List[dict]

oaebu_workflows.onix_utils.create_personname_fields(onix_products)[source]

Given an ONIX product list, attempts to populate the Contributors.PersonName and/or Contributors.PersonNameInverted fields by concatenating the Contributors.NamesBeforeKey and Contributors.KeyNames fields where possible

Parameters:
  • onix – The input onix feed

  • onix_products (List[dict]) –

Returns:

The onix feed with the additional fields populated where possible

Return type:

List[dict]

oaebu_workflows.onix_utils.elevate_product_identifiers(related_product)[source]

Given a single <RelatedProduct>, returns a list of <RelatedProduct> elements by elevating <ProductIdentifier> elements that shouldn’t be there. A <ProductIdentifier> element should only appear once in the <RelatedProduct> unless it has the same <IDValue> and a different <ProductIDType>

<RelatedProduct>

<ProductRelationCode></ProductRelationCode> <ProductIdentifier>

<IDValue></IDValue> <ProductIDType></ProductIDType>

</ProductIdentifier>

</RelatedProduct>

Parameters:

related_product (dict) – The single <RelatedProduct> element with product identifiers to elevate

Returns:

A list of <RelatedProduct> elements

Return type:

List[dict]

Many products have incorrect related products. The related products should be a list of <RelatedProduct> but is instead a single <RelatedProduct> with many <ProductIdentifier> elements. This function fixes this issue by creating a new <RelatedProduct> for each <ProductIdentifier> where necessary. Note that a <RelatedProduct> can still have more than one <ProductIdentifier> if the <IDValue> elements are the same, but the <ProductIDType> is different.

Parameters:
  • product – The list of onix products to fix

  • onix_products (List[dict]) –

Returns:

The amended products

Return type:

List[dict]

Removes any duplicated <RelatedProduct> elements. Will also remove parent ISBNs present in the <RelatedProduct> as they are effectively duplicates

Parameters:
  • product – The list of onix products to fix

  • onix_products (List[dict]) –

Returns:

The amended products

Return type:

List[dict]

Makes a copy of an ONIX product for each of its unique related products with an ISBN. The copies will swap the original ISBN with their own ISBN and make a unique record reference. This “elevates” all related products to the product level. Note that if the product identifier for the related product is not an ISBN, it will not be elevated

Related Product Structure: <RelatedProduct>

<ProductRelationCode></ProductRelationCode> <ProductIdentifier>

<IDValue></IDValue> <ProductIDType></ProductIDType>

</ProductIdentifier> <ProductIdentifier>

<IDValue></IDValue> <ProductIDType></ProductIDType>

</ProductIdentifier>

</RelatedProduct>

Parameters:
  • product – The ONIX product list

  • onix_products (List[dict]) –

Returns:

A list containing the original ONIX product and its related children products.

Return type:

List[dict]

oaebu_workflows.onix_utils._get_product_isbn(product)[source]

Finds the ISBN of a product or relatedproduct

Parameters:

product (dict) –

Return type:

Union[str, None]

Finds the related products of a product

Parameters:

product (dict) –

oaebu_workflows.onix_utils._get_product_identifiers(product)[source]

Finds the product identifiers of a product

Parameters:

product (dict) –

class oaebu_workflows.onix_utils.OnixProduct[source]

Represents a single ONIX product and its identifying reference for simplicity

product: dict[source]
record_reference: str[source]
oaebu_workflows.onix_utils.find_onix_product(all_lines, line_index)[source]

Finds the range of lines encompassing a <Product> tag, given a line_number that is contained in the product

Parameters:
  • all_lines (list) – All lines in the onix file

  • line_number – The line number associated with the product

  • line_index (int) –

Returns:

A two-tuple of the start and end line numbers of the product

Raises:

ValueError – Raised if the return would encompass a negative index, indicating the input line was not in a product

Return type:

OnixProduct

oaebu_workflows.onix_utils.remove_invalid_products(input_xml, output_xml, invalid_products_file=None)[source]

Attempts to validate the input xml as an ONIX file. Will remove any products that contain errors.

Parameters:
  • input_xml (str) – The filepath of the xml file to validate

  • output_xml (str) – The output filepath

  • invalid_products_file (str) – The filepath to write the invalid products to. Ignored if unsupplied.

Return type:

None

oaebu_workflows.onix_utils.filter_through_schema(input, schema)[source]

This function recursively traverses the input dictionary and compares it to the provided schema. It retains only the fields and values that exist in the schema structure, and discards any fields that do not match the schema.

# Example usage with a dictionary and schema:
input_dict = {

“name”: “John”, “age”: 30, “address”: {

“street”: “123 Main St”, “city”: “New York”, “zip”: “10001”

}

} schema = {

“name”: null, “age”: null, “address”: {

“street”: null, “city”: null

}

} filtered_dict = filter_dict_by_schema(input_dict, schema) filtered_dict will be: {

“name”: “John”, “age”: 30, “address”: {

“street”: “123 Main St”, “city”: “New York”

}

}

Parameters:
  • input (Union[dict, list]) – The dictionary to filter

  • schema (dict) – The schema describing the desired structure of the dictionary