Git scraping with Github Actions

Using Github and Github Actions is the best way to scrape a website and get a timeseries of the changes. A good description how to do this is by Simon Willison in his post about Git Scraping.

My first attempt on using this pattern is by scraping the "Gelbe Karten" from the Stuttgart website. The code is on Github: https://github.com/mfa/gelbekarten_stuttgart.

Github Actions can be triggered by a cron schedule, for example:

on:
  schedule:
    - cron:  '23 * * * *'

This yaml triggers every hour on minute 23. An important note here: the Actions don't run exactly on that minute. The Action is scheduled and runs when there are enough workers available. The full code of the Action: https://github.com/mfa/gelbekarten_stuttgart/blob/main/.github/workflows/scrape.yml.

The git log shows the updates of the website as commits on the csv file: https://github.com/mfa/gelbekarten_stuttgart/commits/main/data/gelbe_karten_stuttgart.csv.

Google Cloud Run as message queue for AX Webhooks

Introduction

In the evaluation phase of a new hobby project I built a Google Cloud Run service that accepts and validates webhooks from AX Semantics.

The current version uses Google Cloud Datastore to store the texts. The advantage of saving to Datastore is speed and low costs. Processing the texts can be done later by a cronjob as described in the next section.

An example for a cronjob processing

First: Authentication!

If the cron is run in the Google Cloud everything is set. When not in the Google Cloud (locally or elsewhere), see Google Auth Api Documentation.

For this example I chose the service account with json keyfile option.

The snippet gets all texts for the collection 12345 and prints them:

from google.cloud import datastore
client = datastore.Client()

query = client.query(kind="AX-NLG-Text")
query.add_filter("collection_id", "=", 12345)

for item in query.fetch():
    print(item.get("uid"), " -- ", item.get("data").get("text"))
A cronjob would probably delete the item after processing it, to prohibit another processing on the next cronjob run.
To delete an item use client.delete(item).

Goodreads to Org-Mode

After announcing that all Goodreads API keys are revoked and no new API keys will be issued, it is time to move away. Without API access the migration has to be done by parsing HTML.

All code is in this Github repository: https://github.com/mfa/goodreads-to-orgmode

First step is to download all "My Books" pages. Set them to 100 books per page and download the HTML files. Put this files in the data folder. The convert.py script transfers this html files to one big Org-Mode file.

For example the Entry for "The God Engines" from Scalzi out of my library has this internal representation:

{
  "author": "Scalzi, John",
  "date_added": "2017-02-20",
  "date_read": "2017-02-20",
  "isbn13": "9781596062801",
  "rating": "4 of 5",
  "title": "The God Engines",
  "url": "https://www.goodreads.com/book/show/6470498-the-god-engines"
}

And the resulting Org-Mode block looks like:

*** The God Engines
:PROPERTIES:
:Author: Scalzi, John
:Added: 2017-02-20
:Read: 2017-02-20
:ISBN13: 9781596062801
:Rating: 4 of 5
:Url: https://www.goodreads.com/book/show/6470498-the-god-engines
:END:

After conversion I splitted Fiction from Non-Fiction books by creating new headlines and moving them around.

The rendering of the target markup is done via jinja2. For another target than Org-Mode change the template to something else.