Django: run save on all elements in a table

For a slow model I added caching fields into my Django model with fields that are updated on save. New datasets now have the cached fields but I needed to update the old ones too. The table is pretty big, so I wanted a progressbar (as always, tqdm).

The second problem is that the Djangos model save method returns None and for a lot of elements this is a pretty big list of them. The Python build-in library collections for the rescue. The method deque avoids storing all the elements (because of maxlen=0).

The code I ran in my shell_plus:

import collections
from tqdm import tqdm

iterator = map(lambda x: x.save(), MyModel.objects.all())
with tqdm(iterator, total=MyModel.objects.count(), ascii=True) as pbar:
    collections.deque(pbar, maxlen=0)

For really big tables, the initial creation of the iterator will take quite some time too!

Splitting a big file with split

Currently I am processing wikidata a lot. The trouble here is that the current json dump (wikidata-20190930-all.json.bz2) is 38GB big and storing a decompressed version has more I/O than using the CPU to decompress the file.

I need to process the full wikidata dump a few times. To get some kind of savepoints the best way was to split the file into 100k lines files. Then I can resume at a specific 100k-step point.

To split I used this command:

bzcat wikidata-20190930-all.json.bz2 | pv | split -l 100000 -d -a 4 --filter='bzip2 > $FILE.json.bz2' - split-

First step is of course bzcat to decompress the bz2 part. Second step is pv which adds some kind of progress. Here especially how many GiB were already processed and with what MiB/s rate. Third step is the split operation. The split is by 100000 lines, the numbers added are decimal and with 4 digits. The most interesting part is the filter which is the command the splitted data is wrote to. So the resulting splitted file is recompressed using bzip again. The variable $FILE if the file name generated by split for this splitting step. The - is to read from stdin, which is the data of the commands before and the last parameter is the filename in front of the newly created files.

Process folder with images using OpenCV

This post is a followup to https://madflex.de/posts/find-and-crop-using-opencv/.

The goal is to process a folder with images to get the ones with a display in it. To find errors a version with a rectangle and height/width is saved and a version with the display cropped.

The resulting rectangular image is:

rect_image

And the crop:

crop_image

Sourcecode:

from pathlib import Path
import imutils
import cv2


def find_and_crop(filename):
    fn = filename.stem
    year, month = str(filename.parents[0]).split("/")[-2:]
    output_rect = filename.parents[3] / "output" / f"{year}-{month}-{fn}-rect.png"
    output_crop = filename.parents[3] / "output" / f"{year}-{month}-{fn}-crop.png"

    image = cv2.imread(str(filename))
    image = image[300:1300, 500:1400]
    height, width, channels = image.shape

    # rotate
    center = (width / 2, height / 2)
    angle = 87
    M = cv2.getRotationMatrix2D(center, angle, 1)
    image = cv2.warpAffine(image, M, (height, width))

    # resize
    resized = imutils.resize(image, width=300)
    ratio = height / float(resized.shape[0])

    # greyscale
    gray = cv2.cvtColor(image.copy(), cv2.COLOR_BGR2GRAY)
    gray = cv2.blur(gray, (11, 11))
    thresh = cv2.threshold(gray, 60, 255, cv2.THRESH_BINARY)[1]

    contours, hierarchy = cv2.findContours(thresh, 1, 2)
    for cnt in contours:
        x, y, w, h = cv2.boundingRect(cnt)
        if w > 200 and w < 400 and h > 100 and h < 230:
            break
    else:
        # set because "break" was not triggered -> no rectangle / crop
        x = None

    if x:
        rect = cv2.rectangle(image.copy(), (x, y), (x + w, y + h), (0, 255, 0), 2)

        rect = cv2.putText(
            rect,
            f"h:{h} | w:{w}",
            (10, 50),
            fontFace=cv2.FONT_HERSHEY_SIMPLEX,
            fontScale=1.5,
            color=(255, 0, 0),
            lineType=3,
        )

        cv2.imwrite(str(output_rect), rect)

        # crop image and increase brightness
        cropped = image[y : y + h, x : x + w]
        contrast = cv2.convertScaleAbs(cropped, alpha=3, beta=0)
        cv2.imwrite(str(output_crop), contrast)


def main(base_folder):
    base_folder = Path("../images/")

    for fn in sorted(base_folder.glob("201*/*/*jpg")):
        find_and_crop(fn)

I now have 3700 cropped displays with digits in it. The next step is now detecting the digits (for real this time).