Use Parallel to split by line
The problem: process a big jsonl file line by line. The source file is bz2 compressed to save disk space.
Now use GNU parallel to split the file by line and run a command for every line.
bzcat bigfile.json.bz2 | parallel -j 16 --pipe --block 100M -N 1 python extract.py | gzip > output_`date +%s`.csv.gz
Parallel command options used:
-j 16
-
spawn 16 processes in parallel
--pipe
-
use pipe mode instead of the default argument mode
--block 100M
-
increase the block size from the default of 1M to 100M to be sure to get the full dataset -- this is maybe not necessary
-N 1
-
always give one dataset to the called process
Python usage example
The python file reads from stdin and extracts the wanted information, for example:
import json import sys def extract(string): string = string.strip() if string.endswith(","): # if the file is json and not jsonl string = string[:-1] if string.startswith(("[", "]")): return data = json.loads(string) value = data.get("value") if value: print(f"X:{value}", flush=True) extract(sys.stdin.readline())