Use Parallel to split by line – follow up

After using the parallel pattern described in my previous blog post a few times and improved the speed by quite a bit I had to write a follow up.

The previous solution with -N 1 had too much IO wait. The easiest way to solve this was to increase to a larger chunk of lines, i.e. -N 1000.

For this to work the script has to cope with more than one line. The example from the last post now looks like this:

import json
import sys

def extract(lines):
    for line in lines:
        line = line.strip()
        if line.endswith(","):
            # if the file is json and not jsonl
            line = string[:-1]
        if line.startswith(("[", "]")):
            continue
        data = json.loads(line)
        value = data.get("value")
        if value:
            print(f"X:{value}", flush=True)

extract(sys.stdin.readlines())

The important changes are:

  • readlines() instead of readline() to get all lines given by stdin

  • a loop over the lines instead of only one string

  • continue instead of return to process the remaining lines too

The speedup was for some of my bigger files more than 10x.