Google Books
The Google Books Partner program enables selling books through the Google Play store and offering a preview on Google books.
The program makes books discoverable to Google users around the world on Google books. When readers find a book on Google Books, they can preview a limited number of pages to decide if they’re interested in it.
Readers can also follow links to buy the book or borrow or download it when applicable.
As a publisher you can download reports on Google Books data from https://play.google.com/books/publish/.
Currently there are 3 report types available:
Google Play sales summary report
Google Play sales transaction report
Google Books Traffic Report
In this telescope we collect data from the last 2 reports.
The corresponding tables created in BigQuery are google.google_books_salesYYYYMMDD
and google.google_books_trafficYYYYMMDD
.
Summary |
|
---|---|
Average runtime |
5 min |
Average download size |
1-100 MB |
Harvest Type |
SFTP |
Harvest Frequency |
Weekly |
Runs on remote worker |
False |
Catchup missed runs |
True |
Table Write Disposition |
Truncate |
Update Frequency |
Daily |
Credentials Required |
Yes |
Uses Telescope Template |
Snapshot |
Each shard includes all data |
No |
Telescope object ‘extra’
This telescope is created using the Observatory API. There is one ‘extra’ field that is optional for the corresponding Telescope object, namely the ‘accounts’ field.
accounts
This field is only required if a publisher uses more than 1 Google Books account.
If there are multiple accounts for 1 publisher, the reports of these accounts (for the same report type and month
) are combined in the ‘transform’ step of the telescope.
To distinguish the reports of the same type and date, but from different accounts, a file suffix is used.
When uploading the reports to the SFTP server, this file suffix should be included in the file name.
There are instructions both on how to download and correctly name the reports manually as well as how to do it semi
-automatically using Selenium.
A list of the file suffixes described above should be passed on to the Telescope ‘extra’ object.
Authentication
The reports are downloaded from https://play.google.com/books/publish/. To get access to the reports the publisher needs to give access to a google service account.
This service account can then be used to login on this webpage and download each report manually.
Setting up a service account
Create a service account from IAM & Admin - Service Accounts
Create a JSON key and download the file with key
For each organisation/publisher of interest, ask them to add this service account for Google Books
Downloading Reports Manually
There is no API available to download the Google Books report and it is quite challenging to automate the Google login process through tools such as Selenium, because of Google’s bot detection triggering a reCAPTCHA.
Until this step can be automated, the reports need to be downloaded manually.
For each publisher and for both the sales transaction report and the traffic report:
A report should be created for exactly 1 month (e.g. starting 2021-01-01 and ending 2021-01-31).
All titles should be selected.
All countries should be selected.
The traffic report is organised by ‘Book’.
It is important to save the file with the right name, this should be in the following format (<file_suffix> is optional):
GoogleSalesTransactionReport_<file_suffix>YYYY_MM.csv
orGoogleBooksTrafficReport_<file_suffix>YYYY_MM.csv
Upload each report to the SFTP server at https://oaebu.exavault.com/
Add it to the folder
/telescopes/google_books/<publisher>/upload
Files are automatically moved between folders, please do not move files between folders manually
Using Selenium to help download reports
When downloading many reports it might be faster to use the script below that helps to download the reports.
It is required to run the script in debug mode, so a breakpoint can be set at the right spot (marked in the code) and
you can manually login with your Google account.
From there on, the reports are automatically downloaded on a monthly basis between the given start and end date, for
the given publisher account numbers.
To use Selenium you need the chrome webdriver, this can be downloaded from here
Click to expand and see the full script
import os
import shutil
import time
import pendulum
from selenium import webdriver
def main():
"""Download Google Books traffic and sales report using Selenium.
Needs to be run in debug mode, because it requires manual sign in at breakpoint (to avoid bot detection).
Reports are downloaded at a monthly granularity between the start_date and end_date.
They are downloaded for each publisher in the 'account_numbers' dict and moved to the corresponding subdirectory
in the download directory.
If a publisher has more than 1 account linked a tuple should be used with the publisher name and a file suffix.
The file suffix will be added to the filepath and is used to distinguish reports from different accounts for
the same publisher.
The file suffixes that are used here should be passed on to the telescope 'extra' information as described in the
docs.
The traffic report is organised by 'Book'.
:return: None.
"""
""" Customise values """
download_dir = "/path/to/download/dir"
driver_path = "/path/to/chromedriver"
# Account numbers can be found in the page path when you are signed in to the google books partner center
account_numbers = {
"account_number1": "publisher_name1",
"account_number2": "publisher_name2",
"account_number3": ("publisher_name3", "suffix1"),
"account_number4": ("publisher_name3", "suffix2"),
}
start_date = pendulum.datetime(2018, 1, 1)
end_date = pendulum.now()
""" Customise values """
# Set download dir for webdriver
chrome_options = webdriver.ChromeOptions()
prefs = {"download.default_directory": download_dir}
chrome_options.add_experimental_option("prefs", prefs)
# Initialise webdriver and go to books url to login
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=chrome_options)
driver.get("https://play.google.com/books/publish/")
fmt = "%Y,%-m,%-d" # <-------- set breakpoint here and manually sign in
# Create download dir
if not os.path.exists(download_dir):
os.mkdir(download_dir)
# Loop through publishers
for account_number, publisher in account_numbers.items():
# Get publisher name and file suffix if given
if isinstance(publisher, tuple):
name = publisher[0]
file_suffix = publisher[1]
else:
name = publisher
file_suffix = ""
# Create publisher dir
publisher_dir = os.path.join(download_dir, name)
if not os.path.exists(publisher_dir):
os.mkdir(publisher_dir)
# Loop through months
period = pendulum.period(start_date, end_date)
for dt in period.range("months"):
# Skip month if month is not finished yet
if dt.end_of("month") >= pendulum.now():
continue
# Get start and end date in correct string format
start = dt.strftime(fmt)
end = dt.end_of("month").strftime(fmt)
# Download traffic report
traffic_report_src = os.path.join(download_dir, "GoogleBooksTrafficReport.csv")
traffic_report_dst = os.path.join(
publisher_dir, f'GoogleBooksTrafficReport_{file_suffix}{dt.strftime("%Y_%m")}.csv'
)
url = (
f"https://play.google.com/books/publish/u/2/a/{account_number}/downloadTrafficReport?"
f"f.req=[[null,{start}],[null,{end}],2,false]"
)
download_report(driver, url, traffic_report_src, traffic_report_dst)
# Download sales report
sales_report_src = os.path.join(download_dir, "GoogleSalesTransactionReport.csv")
sales_report_dst = os.path.join(
publisher_dir,
f'GoogleSalesTransactionReport_{file_suffix}{dt.strftime("%Y_%m")}.csv',
)
url = (
f"https://play.google.com/books/publish/a/{account_number}/downloadSalesTransactionReport?"
f"f.req=[[null,{start}],[null,{end}],[],null,null,null,[],[]]"
)
download_report(driver, url, sales_report_src, sales_report_dst)
def download_report(driver: webdriver, url: str, src_path: str, dst_path: str):
"""Download a traffic or sales report from url and move report to a different location.
:param driver: The chrome webdriver
:param url: Download url
:param src_path: File path where file is automatically downloaded to
:param dst_path: File path where file is moved to
:return: None.
"""
# Check if report already exists
if os.path.exists(dst_path):
return
# Download from url
driver.get(url)
while not os.path.exists(src_path):
time.sleep(2)
# Move to correct dir and add date to filename
shutil.move(src_path, dst_path)
print(f"Downloaded: {dst_path}")
if __name__ == "__main__":
main()
Airflow connections
Note that all values need to be urlencoded.
In the config.yaml file, the following airflow connection is required:
sftp_service
The sftp_service airflow connection is used to connect to the sftp_service and download the reports.
The username and password are created by the sftp service and the host is e.g. oaebu.exavault.com
.
The host key is optional, you can get it by running ssh-keyscan, e.g.:
ssh-keyscan oaebu.exavault.com
sftp_service: ssh://<username>:<password>@<host>:<port>?host_key=<host_key>