Use Parallel to split by line

The problem: process a big jsonl file line by line. The source file is bz2 compressed to save disk space.

Now use GNU parallel to split the file by line and run a command for every line.

bzcat bigfile.json.bz2 | parallel -j 16 --pipe --block 100M -N 1 python extract.py | gzip > output_`date +%s`.csv.gz

Parallel command options used:

-j 16

spawn 16 processes in parallel

--pipe

use pipe mode instead of the default argument mode

--block 100M

increase the block size from the default of 1M to 100M to be sure to get the full dataset -- this is maybe not necessary

-N 1

always give one dataset to the called process

Python usage example

The python file reads from stdin and extracts the wanted information, for example:

import json
import sys

def extract(string):
    string = string.strip()
    if string.endswith(","):
        # if the file is json and not jsonl
        string = string[:-1]
    if string.startswith(("[", "]")):
        return
    data = json.loads(string)
    value = data.get("value")
    if value:
        print(f"X:{value}", flush=True)

extract(sys.stdin.readline())