oaebu_workflows.onix_utils
Module Contents
Classes
Class for storing infromation on the java ONIX parser |
|
Constructor for the MetadataTransformer class. |
|
Represents a single ONIX product and its identifying reference for simplicity |
Functions
|
Downloads the ONIX parser from Github |
|
Executes the Java ONIX parser. Requires a .xml file in the input directory. |
|
The book product table creation requires the keywords (under Subjects.SubjectHeadingText) to occur only once |
|
Given an ONIX product list, attempts to populate the Contributors.PersonName and/or Contributors.PersonNameInverted |
|
Given a single <RelatedProduct>, returns a list of <RelatedProduct> elements by elevating |
|
Many products have incorrect related products. The related products should be a list of <RelatedProduct> but is |
|
Removes any duplicated <RelatedProduct> elements. Will also remove parent ISBNs present in the <RelatedProduct> |
|
Makes a copy of an ONIX product for each of its unique related products with an ISBN. |
|
Finds the ISBN of a product or relatedproduct |
|
Finds the related products of a product |
|
Finds the product identifiers of a product |
|
Finds the range of lines encompassing a <Product> tag, given a line_number that is contained in the product |
|
Attempts to validate the input xml as an ONIX file. Will remove any products that contain errors. |
|
This function recursively traverses the input dictionary and compares it to the provided schema. |
- class oaebu_workflows.onix_utils.OnixParser[source]
Class for storing infromation on the java ONIX parser
- Parameters:
filename – The name of the java ONIX parser file
url – The url to use for downloading the parser
template (bash) – The path to the bash template “onix_parser.sh.jinja2”
- class oaebu_workflows.onix_utils.OnixTransformer(*, input_path, output_dir, filter_products=False, error_removal=False, normalise_related_products=False, deduplicate_related_products=False, elevate_related_products=False, add_name_fields=False, collapse_subjects=False, filter_schema=os.path.join(schema_folder(workflow_module='oapen_metadata_telescope'), 'oapen_metadata_filter.json'), invalid_products_name='invalid_products.xml', save_format='jsonl.gz', keep_intermediate=False)[source]
Constructor for the MetadataTransformer class.
- Parameters:
input_path (str) – The path to the metadata file
output_dir (str) – The directory to output the transformed metadata
filter_products (bool) – Filter the metadata through a filter schema
error_removal (bool) – Remove products containing errors
normalise_related_products (bool) – Fix imporperly formatted related products
deduplicate_related_products (bool) – Deduplicate related products
add_name_fields (bool) – Add the Contributor.PersonName and Contributor.InvertedPersonName fields where possible
collapse_subjects (bool) – Collapse subjects into semicolon-separated strings
filter_schema (str) – The filter schema to use. Required if filter_products is True
invalid_products_name (str) – The name of the invalid products file.
save_format (Literal[json, jsonl, jsonl.gz]) – The format to save the transformed metadata in - json, jsonl, or jsonl.gz
keep_intermediate (bool) – Keep the intermediate files
elevate_related_products (bool) –
- property current_metadata: List[dict] | Mapping[str, Any][source]
- Return type:
Union[List[dict], Mapping[str, Any]]
- transform()[source]
Transform the oapen metadata XML file based on the supplied options.
The transformations will be done in the following order. Transforms not included will be skipped: 1) Filter the XML metadata using a schema to keep the desired fields only 2) Remove remaining products containing errors 3) Fix incorrectly formatted related products 4) Elevate related products to the product level 5) Construct the Contributor.PersonName and Contributor.InvertedPersonName fields where possible 6) Parse through the java parser to return .jsonl format - This is always done 7) Collapse subjects into semicolon-separated strings
- _save_metadata(metadata, file_path)[source]
- Parameters:
metadata (Union[List[dict], Mapping[str, Any]]) –
file_path (str) –
- oaebu_workflows.onix_utils.onix_parser_download(download_dir=observatory_home('bin'))[source]
Downloads the ONIX parser from Github
- Parameters:
download_dir (str) – The directory to download the file to
- Returns:
(Whether the download operation was a success, The (expected) location of the downloaded file)
- Return type:
Tuple[bool, str]
- oaebu_workflows.onix_utils.onix_parser_execute(parser_path, input_dir, output_dir)[source]
Executes the Java ONIX parser. Requires a .xml file in the input directory.
- Parameters:
parser_path (str) – Filepath of the parser
input_dir (str) – The input directory - first argument of the parser
output_dir (str) – The output directory - second argument of the parser
- Returns:
Whether the task succeeded or not (return code 0 means success)
- Return type:
bool
- oaebu_workflows.onix_utils.collapse_subjects(onix)[source]
The book product table creation requires the keywords (under Subjects.SubjectHeadingText) to occur only once Some ONIX feeds return all keywords as separate entires. This function finds and collapses each keyword into a semi-colon separated string. Other common separators will be replaced with semi-colons.
- Parameters:
onix (List[dict]) – The onix feed
- Returns:
The onix feed after collapsing the keywords of each row
- Return type:
List[dict]
- oaebu_workflows.onix_utils.create_personname_fields(onix_products)[source]
Given an ONIX product list, attempts to populate the Contributors.PersonName and/or Contributors.PersonNameInverted fields by concatenating the Contributors.NamesBeforeKey and Contributors.KeyNames fields where possible
- Parameters:
onix – The input onix feed
onix_products (List[dict]) –
- Returns:
The onix feed with the additional fields populated where possible
- Return type:
List[dict]
- oaebu_workflows.onix_utils.elevate_product_identifiers(related_product)[source]
Given a single <RelatedProduct>, returns a list of <RelatedProduct> elements by elevating <ProductIdentifier> elements that shouldn’t be there. A <ProductIdentifier> element should only appear once in the <RelatedProduct> unless it has the same <IDValue> and a different <ProductIDType>
- <RelatedProduct>
<ProductRelationCode></ProductRelationCode> <ProductIdentifier>
<IDValue></IDValue> <ProductIDType></ProductIDType>
</ProductIdentifier>
</RelatedProduct>
- Parameters:
related_product (dict) – The single <RelatedProduct> element with product identifiers to elevate
- Returns:
A list of <RelatedProduct> elements
- Return type:
List[dict]
Many products have incorrect related products. The related products should be a list of <RelatedProduct> but is instead a single <RelatedProduct> with many <ProductIdentifier> elements. This function fixes this issue by creating a new <RelatedProduct> for each <ProductIdentifier> where necessary. Note that a <RelatedProduct> can still have more than one <ProductIdentifier> if the <IDValue> elements are the same, but the <ProductIDType> is different.
- Parameters:
product – The list of onix products to fix
onix_products (List[dict]) –
- Returns:
The amended products
- Return type:
List[dict]
Removes any duplicated <RelatedProduct> elements. Will also remove parent ISBNs present in the <RelatedProduct> as they are effectively duplicates
- Parameters:
product – The list of onix products to fix
onix_products (List[dict]) –
- Returns:
The amended products
- Return type:
List[dict]
Makes a copy of an ONIX product for each of its unique related products with an ISBN. The copies will swap the original ISBN with their own ISBN and make a unique record reference. This “elevates” all related products to the product level. Note that if the product identifier for the related product is not an ISBN, it will not be elevated
Related Product Structure: <RelatedProduct>
<ProductRelationCode></ProductRelationCode> <ProductIdentifier>
<IDValue></IDValue> <ProductIDType></ProductIDType>
</ProductIdentifier> <ProductIdentifier>
<IDValue></IDValue> <ProductIDType></ProductIDType>
</ProductIdentifier>
</RelatedProduct>
- Parameters:
product – The ONIX product list
onix_products (List[dict]) –
- Returns:
A list containing the original ONIX product and its related children products.
- Return type:
List[dict]
- oaebu_workflows.onix_utils._get_product_isbn(product)[source]
Finds the ISBN of a product or relatedproduct
- Parameters:
product (dict) –
- Return type:
Union[str, None]
Finds the related products of a product
- Parameters:
product (dict) –
- oaebu_workflows.onix_utils._get_product_identifiers(product)[source]
Finds the product identifiers of a product
- Parameters:
product (dict) –
- class oaebu_workflows.onix_utils.OnixProduct[source]
Represents a single ONIX product and its identifying reference for simplicity
- oaebu_workflows.onix_utils.find_onix_product(all_lines, line_index)[source]
Finds the range of lines encompassing a <Product> tag, given a line_number that is contained in the product
- Parameters:
all_lines (list) – All lines in the onix file
line_number – The line number associated with the product
line_index (int) –
- Returns:
A two-tuple of the start and end line numbers of the product
- Raises:
ValueError – Raised if the return would encompass a negative index, indicating the input line was not in a product
- Return type:
- oaebu_workflows.onix_utils.remove_invalid_products(input_xml, output_xml, invalid_products_file=None)[source]
Attempts to validate the input xml as an ONIX file. Will remove any products that contain errors.
- Parameters:
input_xml (str) – The filepath of the xml file to validate
output_xml (str) – The output filepath
invalid_products_file (str) – The filepath to write the invalid products to. Ignored if unsupplied.
- Return type:
None
- oaebu_workflows.onix_utils.filter_through_schema(input, schema)[source]
This function recursively traverses the input dictionary and compares it to the provided schema. It retains only the fields and values that exist in the schema structure, and discards any fields that do not match the schema.
- # Example usage with a dictionary and schema:
- input_dict = {
“name”: “John”, “age”: 30, “address”: {
“street”: “123 Main St”, “city”: “New York”, “zip”: “10001”
}
} schema = {
“name”: null, “age”: null, “address”: {
“street”: null, “city”: null
}
} filtered_dict = filter_dict_by_schema(input_dict, schema) filtered_dict will be: {
“name”: “John”, “age”: 30, “address”: {
“street”: “123 Main St”, “city”: “New York”
}
}
- Parameters:
input (Union[dict, list]) – The dictionary to filter
schema (dict) – The schema describing the desired structure of the dictionary