
Module Contents



Class for storing infromation on the java ONIX parser


Constructor for the MetadataTransformer class.


Represents a single ONIX product and its identifying reference for simplicity



Downloads the ONIX parser from Github

onix_parser_execute(parser_path, input_dir, output_dir)

Executes the Java ONIX parser. Requires a .xml file in the input directory.


The book product table creation requires the keywords (under Subjects.SubjectHeadingText) to occur only once


Given an ONIX product list, attempts to populate the Contributors.PersonName and/or Contributors.PersonNameInverted


Given a single <RelatedProduct>, returns a list of <RelatedProduct> elements by elevating


Many products have incorrect related products. The related products should be a list of <RelatedProduct> but is


Removes any duplicated <RelatedProduct> elements. Will also remove parent ISBNs present in the <RelatedProduct>


Makes a copy of an ONIX product for each of its unique related products with an ISBN.


Finds the ISBN of a product or relatedproduct


Finds the related products of a product


Finds the product identifiers of a product

find_onix_product(all_lines, line_index)

Finds the range of lines encompassing a <Product> tag, given a line_number that is contained in the product

remove_invalid_products(input_xml, output_xml[, ...])

Attempts to validate the input xml as an ONIX file. Will remove any products that contain errors.

filter_through_schema(input, schema)

This function recursively traverses the input dictionary and compares it to the provided schema.

class oaebu_workflows.onix_utils.OnixParser[source]

Class for storing infromation on the java ONIX parser

  • filename – The name of the java ONIX parser file

  • url – The url to use for downloading the parser

  • template (bash) – The path to the bash template “onix_parser.sh.jinja2”

filename = 'coki-onix-parser-1.2-SNAPSHOT-shaded.jar'[source]
url = 'https://github.com/The-Academic-Observatory/onix-parser/releases/download/v1.3.0/coki-onix-parser...'[source]
cmd = 'java -jar {parser_path} {input_dir} {output_dir}'[source]
class oaebu_workflows.onix_utils.OnixTransformer(*, input_path, output_dir, filter_products=False, error_removal=False, normalise_related_products=False, deduplicate_related_products=False, elevate_related_products=False, add_name_fields=False, collapse_subjects=False, filter_schema=os.path.join(schema_folder(workflow_module='oapen_metadata_telescope'), 'oapen_metadata_filter.json'), invalid_products_name='invalid_products.xml', save_format='jsonl.gz', keep_intermediate=False)[source]

Constructor for the MetadataTransformer class.

  • input_path (str) – The path to the metadata file

  • output_dir (str) – The directory to output the transformed metadata

  • filter_products (bool) – Filter the metadata through a filter schema

  • error_removal (bool) – Remove products containing errors

  • normalise_related_products (bool) – Fix imporperly formatted related products

  • deduplicate_related_products (bool) – Deduplicate related products

  • add_name_fields (bool) – Add the Contributor.PersonName and Contributor.InvertedPersonName fields where possible

  • collapse_subjects (bool) – Collapse subjects into semicolon-separated strings

  • filter_schema (str) – The filter schema to use. Required if filter_products is True

  • invalid_products_name (str) – The name of the invalid products file.

  • save_format (Literal[json, jsonl, jsonl.gz]) – The format to save the transformed metadata in - json, jsonl, or jsonl.gz

  • keep_intermediate (bool) – Keep the intermediate files

  • elevate_related_products (bool) –

property current_metadata: List[dict] | Mapping[str, Any][source]
Return type:

Union[List[dict], Mapping[str, Any]]


Transform the oapen metadata XML file based on the supplied options.

The transformations will be done in the following order. Transforms not included will be skipped: 1) Filter the XML metadata using a schema to keep the desired fields only 2) Remove remaining products containing errors 3) Fix incorrectly formatted related products 4) Elevate related products to the product level 5) Construct the Contributor.PersonName and Contributor.InvertedPersonName fields where possible 6) Parse through the java parser to return .jsonl format - This is always done 7) Collapse subjects into semicolon-separated strings

_save_metadata(metadata, file_path)[source]
  • metadata (Union[List[dict], Mapping[str, Any]]) –

  • file_path (str) –


file_path (str) –


Downloads the ONIX parser from Github


download_dir (str) – The directory to download the file to


(Whether the download operation was a success, The (expected) location of the downloaded file)

Return type:

Tuple[bool, str]

oaebu_workflows.onix_utils.onix_parser_execute(parser_path, input_dir, output_dir)[source]

Executes the Java ONIX parser. Requires a .xml file in the input directory.

  • parser_path (str) – Filepath of the parser

  • input_dir (str) – The input directory - first argument of the parser

  • output_dir (str) – The output directory - second argument of the parser


Whether the task succeeded or not (return code 0 means success)

Return type:



The book product table creation requires the keywords (under Subjects.SubjectHeadingText) to occur only once Some ONIX feeds return all keywords as separate entires. This function finds and collapses each keyword into a semi-colon separated string. Other common separators will be replaced with semi-colons.


onix (List[dict]) – The onix feed


The onix feed after collapsing the keywords of each row

Return type:



Given an ONIX product list, attempts to populate the Contributors.PersonName and/or Contributors.PersonNameInverted fields by concatenating the Contributors.NamesBeforeKey and Contributors.KeyNames fields where possible

  • onix – The input onix feed

  • onix_products (List[dict]) –


The onix feed with the additional fields populated where possible

Return type:



Given a single <RelatedProduct>, returns a list of <RelatedProduct> elements by elevating <ProductIdentifier> elements that shouldn’t be there. A <ProductIdentifier> element should only appear once in the <RelatedProduct> unless it has the same <IDValue> and a different <ProductIDType>


<ProductRelationCode></ProductRelationCode> <ProductIdentifier>

<IDValue></IDValue> <ProductIDType></ProductIDType>




related_product (dict) – The single <RelatedProduct> element with product identifiers to elevate


A list of <RelatedProduct> elements

Return type:


Many products have incorrect related products. The related products should be a list of <RelatedProduct> but is instead a single <RelatedProduct> with many <ProductIdentifier> elements. This function fixes this issue by creating a new <RelatedProduct> for each <ProductIdentifier> where necessary. Note that a <RelatedProduct> can still have more than one <ProductIdentifier> if the <IDValue> elements are the same, but the <ProductIDType> is different.

  • product – The list of onix products to fix

  • onix_products (List[dict]) –


The amended products

Return type:


Removes any duplicated <RelatedProduct> elements. Will also remove parent ISBNs present in the <RelatedProduct> as they are effectively duplicates

  • product – The list of onix products to fix

  • onix_products (List[dict]) –


The amended products

Return type:


Makes a copy of an ONIX product for each of its unique related products with an ISBN. The copies will swap the original ISBN with their own ISBN and make a unique record reference. This “elevates” all related products to the product level. Note that if the product identifier for the related product is not an ISBN, it will not be elevated

Related Product Structure: <RelatedProduct>

<ProductRelationCode></ProductRelationCode> <ProductIdentifier>

<IDValue></IDValue> <ProductIDType></ProductIDType>

</ProductIdentifier> <ProductIdentifier>

<IDValue></IDValue> <ProductIDType></ProductIDType>



  • product – The ONIX product list

  • onix_products (List[dict]) –


A list containing the original ONIX product and its related children products.

Return type:



Finds the ISBN of a product or relatedproduct


product (dict) –

Return type:

Union[str, None]

Finds the related products of a product


product (dict) –


Finds the product identifiers of a product


product (dict) –

class oaebu_workflows.onix_utils.OnixProduct[source]

Represents a single ONIX product and its identifying reference for simplicity

product: dict[source]
record_reference: str[source]
oaebu_workflows.onix_utils.find_onix_product(all_lines, line_index)[source]

Finds the range of lines encompassing a <Product> tag, given a line_number that is contained in the product

  • all_lines (list) – All lines in the onix file

  • line_number – The line number associated with the product

  • line_index (int) –


A two-tuple of the start and end line numbers of the product


ValueError – Raised if the return would encompass a negative index, indicating the input line was not in a product

Return type:


oaebu_workflows.onix_utils.remove_invalid_products(input_xml, output_xml, invalid_products_file=None)[source]

Attempts to validate the input xml as an ONIX file. Will remove any products that contain errors.

  • input_xml (str) – The filepath of the xml file to validate

  • output_xml (str) – The output filepath

  • invalid_products_file (str) – The filepath to write the invalid products to. Ignored if unsupplied.

Return type:


oaebu_workflows.onix_utils.filter_through_schema(input, schema)[source]

This function recursively traverses the input dictionary and compares it to the provided schema. It retains only the fields and values that exist in the schema structure, and discards any fields that do not match the schema.

# Example usage with a dictionary and schema:
input_dict = {

“name”: “John”, “age”: 30, “address”: {

“street”: “123 Main St”, “city”: “New York”, “zip”: “10001”


} schema = {

“name”: null, “age”: null, “address”: {

“street”: null, “city”: null


} filtered_dict = filter_dict_by_schema(input_dict, schema) filtered_dict will be: {

“name”: “John”, “age”: 30, “address”: {

“street”: “123 Main St”, “city”: “New York”



  • input (Union[dict, list]) – The dictionary to filter

  • schema (dict) – The schema describing the desired structure of the dictionary