Currently I am processing wikidata a lot.
The trouble here is that the current json dump (
38GB big and storing a decompressed version has more I/O than using the CPU to decompress the file.
I need to process the full wikidata dump a few times. To get some kind of savepoints the best way was to split the file into 100k lines files. Then I can resume at a specific 100k-step point.
To split I used this command:
bzcat wikidata-20190930-all.json.bz2 | pv | split -l 100000 -d -a 4 --filter='bzip2 > $FILE.json.bz2' - split-
First step is of course
bzcat to decompress the bz2 part.
Second step is pv which adds some kind of progress. Here especially how many GiB were already processed and with what MiB/s rate.
Third step is the
split operation. The split is by 100000 lines, the numbers added are decimal and with 4 digits.
The most interesting part is the filter which is the command the splitted data is wrote to. So the resulting splitted file is recompressed using bzip again.
$FILE if the file name generated by split for this splitting step.
- is to read from stdin, which is the data of the commands before and the last parameter is the filename in front of the newly created files.