Git annex -- lessons learned

For storing my 2.3TB of images using git-annex I learned a bit.

Instead of saving every single image in git annex I store every folder if images in tar files. Git annex gets pretty slow when storing over one million files. With the big tar files this problem is solved.

To use the files I extract them to a folder shared via sshfs between other servers. Everything is within my tinc, so that access to computers at home is also possible.

Git annex helps me keep files in sync between storage at home and images partially synced to internet servers.

Git annex and rsync

I started to use git-annex for my GoPro archive (atm 2TB of jpg files). But for my upload scripts to openstreetcam and mapillary I needed the files and not the symlinks git-annex is using.

rsync for the rescue.

From the rsync man page:

-L, --copy-links            transform symlink into referent file/dir

In my scripts I now use rsync -aL source target to get the files out of a git-annex repository.

Bundesbank bank data to sqlite

The German Bundesbank releases every quarter of a year a list of all banks in Germany. We build a Django based api website for this data a while ago: https://github.com/opendata-stuttgart/banking-api. The site went offline after a while.

After experimenting a bit with Datasette I built a converter of the Bundesbank Excel file to sqlite.

The code is using sqlite-utils for the sqlite part and openpxl for the Excel part.

Main converting loop:

def convert(db, xlsx_path):
    wb = load_workbook(filename=xlsx_path, read_only=True)

    # use the first worksheet
    ws = wb[wb.sheetnames[0]]

    translate_fieldnames = {
        'Bezeichnung': 'name',
        'Bank-leitzahl': 'blz',
        'BIC': 'bic',
        'PLZ': 'zipcode',
        'Ort': 'city',
        'Kurzbezeichnung': 'short_description',
        'PAN': 'pan',
        'Prüfziffer-berechnungs-methode': 'check_calculation_method',
        'Datensatz-nummer': 'dataset_number',
        'Merkmal': 'merkmal',
        'Änderungs-kennzeichen': 'change_type',
        'Bankleitzahl-löschung': 'is_deletion',
        'Nachfolge-Bankleitzahl': 'following_blz',
    }

    # translate tablenames to English names
    field_names = []
    for item in list(ws.iter_rows(min_row=1, max_row=1, values_only=True))[0]:
        field_names.append(translate_fieldnames[item])

    for row in ws.iter_rows(min_row=2, values_only=True):
        db["bundesbank_blz"].insert(
            record=dict(zip(field_names, row)), pk="blz", replace=True)

Especially the db["tablename"].insert() from sqlite-utils makes adding data to a sqlite databases really easy.

The full code is on Github: https://github.com/mfa/bundesbank-to-sqlite.