oaebu_workflows.onix_workflow.onix_work_aggregation

Module Contents

Classes

UnionFind

Union Find using weighted quick union and path compression.

BookWork

A book work is an abstract entity comprising the intellectual property embodied in a manifestation.

BookWorkFamily

A book work family aggregates different editions of a work together.

BookWorkAggregator

Aggregates ONIX records into "works" (BookWork object). If the WorkID exists in the ONIX record, it will use that.

BookWorkFamilyAggregator

Aggregates different editions of works into a family. The methodology will be similar to BookWorkAggregator.

Functions

get_pref_product_id(relprod)

Get the most preferred product ID. It will return the one with the lowest preference

class oaebu_workflows.onix_workflow.onix_work_aggregation.UnionFind(size)[source]

Union Find using weighted quick union and path compression. Instead of working on objects and requiring further implementation of comparators and size methods, this will operate on integers. Users should handle the mapping from objects to a distinct integer, e.g., by using the index of the objects if they are in a list. See: https://en.wikipedia.org/wiki/Disjoint-set_data_structure and https://www.cs.princeton.edu/~rs/AlgsDS07/01UnionFind.pdf

Parameters:

size (int) – Number of elements we are dealing with in total.

root(node)[source]

Find the root (representative) of an object. Update root mappings along the way (path compression). :param node: Object to find the root for. :return: The object’s root representative.

Parameters:

node (int) –

Return type:

int

find(p, q)[source]

Check if two objects have the same root representative. :param p: First object. :param q: Second object. :return: Whether p, q have the same root representative.

Parameters:
  • p (int) –

  • q (int) –

Return type:

bool

unite(p, q)[source]

Merge two objects into the same class, i.e., make the two objects have the same root representative. Use weighted union to merge smaller trees into the bigger trees. :param p First object. :param q Second object.

Parameters:
  • p (int) –

  • q (int) –

get_partition()[source]

Get the current class partition of the objects. :return: Partition of the objects as a list of lists (no guaranteed ordering).

Return type:

List[List[int]]

class oaebu_workflows.onix_workflow.onix_work_aggregation.BookWork(*, work_id, work_id_type, products)[source]

A book work is an abstract entity comprising the intellectual property embodied in a manifestation. Works manifest themselves as products of different form, e.g., paperback, PDF.

Parameters:
  • work_id (str) – The Work ID.

  • work_id_type (str) – Type scheme used in the Work ID.

  • products (List[Dict]) – List of products that manifest the work.

add_product(product)[source]

Add a product to the work. :param product: A product record.

Parameters:

product (dict) –

class oaebu_workflows.onix_workflow.onix_work_aggregation.BookWorkFamily(*, works, work_family_id=None, work_family_id_type=None)[source]

A book work family aggregates different editions of a work together.

Parameters:
  • works (List[BookWork]) – List of works in the family.

  • work_family_id – Work family ID.

  • work_family_id_type (Union[None, str]) – Type of Work family ID.

oaebu_workflows.onix_workflow.onix_work_aggregation.get_pref_product_id(relprod)[source]

Get the most preferred product ID. It will return the one with the lowest preference number in the id_pref list, or an arbitrary identifier from the list if the listed preferences are not found. :param relprod: Related product record. :return: The Product ID type, and Product ID as a pair.

Parameters:

relprod (dict) –

Return type:

Tuple[str, str]

class oaebu_workflows.onix_workflow.onix_work_aggregation.BookWorkAggregator(records)[source]

Aggregates ONIX records into “works” (BookWork object). If the WorkID exists in the ONIX record, it will use that. The order of preference for the work identifier types is: ISBN-13 > DOI > Proprietary > everything else. If no identifier is specified in ONIX, i.e., no RelatedWorks info, one of the ISBN13 in the work will be used as a representative ID.

Parameters:

records (List[dict]) – List of ONIX Product records.

filter_out_duplicate_records(records)[source]

Filter out records with duplicate ISBNs. Logs the duplicates, and returns the filtered records.

Parameters:

records (dict) – Product records.

Returns:

Tuple of a list of filtered records, and a list of ISBNs which appear more than once in a record.

Return type:

List[dict]

log_duplicate_isbns(duplicates)[source]

Log the list of duplicate ISBNs encountered.

Parameters:

duplicates (List[str]) – List of duplicate ISBNs.

get_pref_work_id(identifiers)[source]

Tries to map the work identifier back to ISBN. :param identifiers: List of WorkIdentifiers. :return: Preferred work id type and the work id. None represents unknoown work_id type.

Parameters:

identifiers (List[Dict]) –

Return type:

Tuple[Union[None, str], Union[None, str]]

aggregate()[source]

Run the aggregation process. Separate out the records into those containing RelatedWorks info, RelatedProducts info, and neithe. For a single publisher, this should only ever be 1 of the cases. Run different aggregation procedures for the 3 cases. If publishers are doing something funky, then we need to revisit this procedure. :return: List of BookWork objects representing the aggregated product records.

Return type:

List[BookWork]

is_relevant_work_relation(relwork)[source]

Check if the work relation code indicates a manifestation. :param relwork: Related work. :return: Whether the related work is a manifestation of the current work.

Parameters:

relwork (dict) –

Return type:

bool

log_agg_relworks_errors(pisbn, wtype, wid)[source]

Log any errors from aggregating along RelatedWorks. :param pisbn: The product’s ISBN. :param wtype: Type of WorkID. :param wid: WorkID of the related work. :return: True if we logged an error, False if it was OK.

Parameters:
  • pisbn (str) –

  • wtype (str) –

  • wid (str) –

agg_relworks()[source]

Collect the entries with “Manifestation of” relation codes into a single work. The Work ID from that field will be used. This assumes that every product record has a “Manifestation of” field entry that either points to itself or something else. Revisit this if a publisher does it differently. :return: List of BookWork objects that categorise the product records.

Return type:

List[BookWork]

get_pid_idx(pid_type, pid)[source]

Get the product index (in self.records) using the Product ID information. :param pid_type: Product ID type. :param pid: Product ID. :return: Index to the product or None if record doesn’t exist.

Parameters:
  • pid_type (str) –

  • pid (str) –

Return type:

Union[None, int]

set_relevant_product_relation_codes()[source]
Returns:

Set of relevant product codes indicating a manifestation of the current work.

Return type:

Set[str]

is_relevant_product_relation(relprod)[source]

Check if the related product has a relation indicating it’s the same work as the current work. :param relprod: Related product. :return: Whether the product is a manifestation.

Parameters:

relprod (dict) –

Return type:

bool

Log the errors from aggregating along RelatedProducts.

Parameters:
  • pisbn (str) – The product’s ISBN.

  • relation (str) – The relation code.

  • ptype (str) – Related product’s identifer type.

  • pid (str) – Related product’s identifier.

get_works_from_partition(partition)[source]

Convert the partition of equivalence classes of record indices into equivalence classes of BookWork objects. :param partition: Partition of product record indices as works equivalence classes. :return: Partition as BookWork objects.

Parameters:

partition (List[List[int]]) –

Return type:

List[BookWork]

agg_relproducts()[source]

Aggregate the entries with targeted relation codes into a single work. Currently this is:

“Alternative format”.

The Work ID will be an arbitrary ISBN13 representative from each work. :return: List of BookWork objects that categorise the product records.

Return type:

List[BookWork]

log_get_works_lookup_table_errors(manifestations, isbn)[source]

Log an error when an ISBN is assigned to multiple WorkIDs.

Parameters:
  • manifestations (Set[str]) – List of work IDs manifested by an ISBN.

  • isbn (str) – ISBN that has multiple WorkID assignments.

get_works_lookup_table()[source]

Aggregate the products into works, and output a list of dicts ready for jsonline conversion and BQ loading. Keys: ISBN, Work ID.

Returns:

List of dicts.

Return type:

List[dict]

class oaebu_workflows.onix_workflow.onix_work_aggregation.BookWorkFamilyAggregator(works)[source]

Aggregates different editions of works into a family. The methodology will be similar to BookWorkAggregator. This works with lists of BookWork objects rather than product records, so you need to have already aggregated products to works.

Parameters:

works (List[BookWork]) – List of work objects.

set_relevant_product_codes()[source]
Returns:

Set of relevant product codes indicating different editions.

Return type:

Set[str]

aggregate()[source]

Run the aggregation process. Things that hint at edition information:

  1. “Replaces”, “Replaced by”, “Is later edition of first edition” relation in RelatedProducts.

  2. [Not implemented] “Derived from” and “Related work is derived from this” relation in RelatedWorks. This might need to be supplemented with other info.

  3. [Not implemented] Title, Authors, EditionNumber.

Returns:

List of book work families.

Return type:

List[BookWorkFamily]

get_identifier_to_index_table(identifier)[source]

Create a lookup table mapping identifiers to the index of the works list. :param identifier: Identifier name, e.g., ISBN13. :return: Lookup table.

Parameters:

identifier (str) –

Return type:

Dict

get_wid_idx(pid_type, pid, isbn13_to_index, gtin13_to_index, proprietary_to_index)[source]

Get the index into the works list for the product id.

Parameters:
  • pid_type (str) – Product identifier type.

  • pid (str) – Product ID.

  • isbn13_to_index (dict) – ISBN lookup table.

  • gtin13_to_index (dict) – GTIN13 lookup table.

  • proprietary_to_index (dict) – Proprietary ID lookup table.

Returns:

Index for the work in the works list, or None if the record doesn’t exist.

Return type:

Union[None, int]

is_relevant_product_relation(relprod)[source]

Check whether a related product contains a code indicating different (equivalent content) editions. :param relprod: Related product. :return: Whether the related product is a different edition.

Parameters:

relprod (dict) –

Return type:

bool

Partition the works into equivalence classes of work families based on product relation codes. :return: Partition of the works into work families (using work indices of the works list).

Return type:

List[List[int]]

agg_relproducts()[source]

Collect the entries with “Replaces”, “Replaced by”, “Is later edition of first edition” relation codes into a single family. The Work Family ID will be an arbitrary WorkID representative. :return: List of BookWork objects that categorise the product records.

Return type:

List[BookWorkFamily]

get_works_family_lookup_table()[source]

Aggregate the works into work families, and output a list of dicts ready for jsonline conversion and BQ loading. Keys: ISBN, Work family ID.

Returns:

List of dicts.

Return type:

Dict