Use Parallel to split by line – follow up
After using the parallel pattern described in my previous blog post a few times and improved the speed by quite a bit I had to write a follow up.
The previous solution with -N 1
had too much IO wait.
The easiest way to solve this was to increase to a larger chunk of lines, i.e. -N 1000
.
For this to work the script has to cope with more than one line. The example from the last post now looks like this:
import json import sys def extract(lines): for line in lines: line = line.strip() if line.endswith(","): # if the file is json and not jsonl line = string[:-1] if line.startswith(("[", "]")): continue data = json.loads(line) value = data.get("value") if value: print(f"X:{value}", flush=True) extract(sys.stdin.readlines())
The important changes are:
readlines()
instead ofreadline()
to get all lines given by stdina loop over the lines instead of only one string
continue
instead ofreturn
to process the remaining lines too
The speedup was for some of my bigger files more than 10x.