Git scraping with Github Actions

Using Github and Github Actions is the best way to scrape a website and get a timeseries of the changes. A good description how to do this is by Simon Willison in his post about Git Scraping.

My first attempt on using this pattern is by scraping the "Gelbe Karten" from the Stuttgart website. The code is on Github: https://github.com/mfa/gelbekarten_stuttgart.

Github Actions can be triggered by a cron schedule, for example:

on:
  schedule:
    - cron:  '23 * * * *'

This yaml triggers every hour on minute 23. An important note here: the Actions don't run exactly on that minute. The Action is scheduled and runs when there are enough workers available. The full code of the Action: https://github.com/mfa/gelbekarten_stuttgart/blob/main/.github/workflows/scrape.yml.

The git log shows the updates of the website as commits on the csv file: https://github.com/mfa/gelbekarten_stuttgart/commits/main/data/gelbe_karten_stuttgart.csv.