Stuttgart Gelbe Karten Scraping Update

I started scraping the data for Stuttgarts "Gelbe Karten", which is issues reported by residents, in the beginning of 2021 and wrote about it. Then I opted to add everything to one CSV file. This CSV file had nearly 70k lines after over 3 years and was no fun to use. So I decided to split the one CSV into daily CSV files in yearly folders. After migrating everything and updating the scripts I force-pushed to Github to get rid of the thousands of Git commits the Github Action made while scraping. As a result of the force-push the timeseries of changes is only in the CSV files and not in the commits anymore -- which seems like a good tradeoff versus the local clone time for the repo.

The code is on Github:

The whole scriping experiment showed me a few things:

  • Github Actions for scraping are quite reliable; thanks again Simon Willison for starting this trend.

  • The city of Stuttgart didn't change anything about their infrastructure (positive for scraping; but on the other hand no improvement happened)

  • Scraping every hour generates way to many commits