Splitting a big file with split

Currently I am processing wikidata a lot. The trouble here is that the current json dump (wikidata-20190930-all.json.bz2) is 38GB big and storing a decompressed version has more I/O than using the CPU to decompress the file.

I need to process the full wikidata dump a few times. To get some kind of savepoints the best way was to split the file into 100k lines files. Then I can resume at a specific 100k-step point.

To split I used this command:

bzcat wikidata-20190930-all.json.bz2 | pv | split -l 100000 -d -a 4 --filter='bzip2 > $FILE.json.bz2' - split-

First step is of course bzcat to decompress the bz2 part. Second step is pv which adds some kind of progress. Here especially how many GiB were already processed and with what MiB/s rate. Third step is the split operation. The split is by 100000 lines, the numbers added are decimal and with 4 digits. The most interesting part is the filter which is the command the split data is wrote to. So the resulting split file is recompressed using bzip again. The variable $FILE if the file name generated by split for this splitting step. The - is to read from stdin, which is the data of the commands before and the last parameter is the filename in front of the newly created files.

Process folder with images using OpenCV

This post is a followup to https://madflex.de/posts/find-and-crop-using-opencv/.

The goal is to process a folder with images to get the ones with a display in it. To find errors a version with a rectangle and height/width is saved and a version with the display cropped.

The resulting rectangular image is:

rect_image

And the crop:

crop_image

Sourcecode:

from pathlib import Path
import imutils
import cv2


def find_and_crop(filename):
    fn = filename.stem
    year, month = str(filename.parents[0]).split("/")[-2:]
    output_rect = filename.parents[3] / "output" / f"{year}-{month}-{fn}-rect.png"
    output_crop = filename.parents[3] / "output" / f"{year}-{month}-{fn}-crop.png"

    image = cv2.imread(str(filename))
    image = image[300:1300, 500:1400]
    height, width, channels = image.shape

    # rotate
    center = (width / 2, height / 2)
    angle = 87
    M = cv2.getRotationMatrix2D(center, angle, 1)
    image = cv2.warpAffine(image, M, (height, width))

    # resize
    resized = imutils.resize(image, width=300)
    ratio = height / float(resized.shape[0])

    # greyscale
    gray = cv2.cvtColor(image.copy(), cv2.COLOR_BGR2GRAY)
    gray = cv2.blur(gray, (11, 11))
    thresh = cv2.threshold(gray, 60, 255, cv2.THRESH_BINARY)[1]

    contours, hierarchy = cv2.findContours(thresh, 1, 2)
    for cnt in contours:
        x, y, w, h = cv2.boundingRect(cnt)
        if w > 200 and w < 400 and h > 100 and h < 230:
            break
    else:
        # set because "break" was not triggered -> no rectangle / crop
        x = None

    if x:
        rect = cv2.rectangle(image.copy(), (x, y), (x + w, y + h), (0, 255, 0), 2)

        rect = cv2.putText(
            rect,
            f"h:{h} | w:{w}",
            (10, 50),
            fontFace=cv2.FONT_HERSHEY_SIMPLEX,
            fontScale=1.5,
            color=(255, 0, 0),
            lineType=3,
        )

        cv2.imwrite(str(output_rect), rect)

        # crop image and increase brightness
        cropped = image[y : y + h, x : x + w]
        contrast = cv2.convertScaleAbs(cropped, alpha=3, beta=0)
        cv2.imwrite(str(output_crop), contrast)


def main(base_folder):
    base_folder = Path("../images/")

    for fn in sorted(base_folder.glob("201*/*/*jpg")):
        find_and_crop(fn)

I now have 3700 cropped displays with digits in it. The next step is now detecting the digits (for real this time).

Get exif data for all files in a folder

Goal: Generate a json for every folder with all the gps data of all the images in the folder.

import json
from exif import get_location
from pathlib import Path
import exifread

base = Path("images")
for folder in sorted(base.glob("201*")):
    if folder.is_dir():
        gps_data = {}
        json_fn = folder / "gps.json"
        if json_fn.exists():
            continue
        for fn in sorted(folder.glob("*.JPG"):
            with open(folder / fn, "rb") as f:
               gps_data[fn.name] = get_location(exifread.process_file(f))
        if gps_data:
           json.dump(gps_data, open(json_fn, "w"))

this code uses get_location from this blog post: https://madflex.de/posts/get-gps-data-from-images/