Use Parallel to split by line – follow up

Andreas

2021-05-30 18:00

After using the parallel pattern described in my previous blog post a few times and improved the speed by quite a bit I had to write a follow up.

The previous solution with -N 1 had too much IO wait. The easiest way to solve this was to increase to a larger chunk of lines, i.e. -N 1000.

For this to work the script has to cope with more than one line. The example from the last post now looks like this:

import json
import sys

def extract(lines):
    for line in lines:
        line = line.strip()
        if line.endswith(","):
            # if the file is json and not jsonl
            line = string[:-1]
        if line.startswith(("[", "]")):
            continue
        data = json.loads(line)
        value = data.get("value")
        if value:
            print(f"X:{value}", flush=True)

extract(sys.stdin.readlines())

The important changes are:

readlines() instead of readline() to get all lines given by stdin
a loop over the lines instead of only one string
continue instead of return to process the remaining lines too

The speedup was for some of my bigger files more than 10x.