After using the parallel pattern described in my previous blog post a few times and improved the speed by quite a bit I had to write a follow up.
The previous solution with -N 1
had too much IO wait.
The easiest way to solve this was to increase to a larger chunk of lines, i.e. -N 1000
.
For this to work the script has to cope with more than one line.
The example from the last post now looks like this:
import json
import sys
def extract(lines):
for line in lines:
line = line.strip()
if line.endswith(","):
# if the file is json and not jsonl
line = string[:-1]
if line.startswith(("[", "]")):
continue
data = json.loads(line)
value = data.get("value")
if value:
print(f"X:{value}", flush=True)
extract(sys.stdin.readlines())
The important changes are:
readlines()
instead of readline()
to get all lines given by stdin
a loop over the lines instead of only one string
continue
instead of return
to process the remaining lines too
The speedup was for some of my bigger files more than 10x.