After using the parallel pattern described in my previous blog post a few times and improved the speed by quite a bit I had to write a follow up.
The previous solution with
-N 1 had too much IO wait.
The easiest way to solve this was to increase to a larger chunk of lines, i.e.
For this to work the script has to cope with more than one line.
The example from the last post now looks like this:
for line in lines:
line = line.strip()
# if the file is json and not jsonl
line = string[:-1]
if line.startswith(("[", "]")):
data = json.loads(line)
value = data.get("value")
The important changes are:
readlines() instead of
readline() to get all lines given by stdin
a loop over the lines instead of only one string
continue instead of
return to process the remaining lines too
The speedup was for some of my bigger files more than 10x.