For a slow model I added caching fields into my Django model with fields that are updated on
New datasets now have the cached fields but I needed to update the old ones too.
The table is pretty big, so I wanted a progressbar (as always, tqdm).
The second problem is that the Djangos model
save method returns
None and for a lot of elements this is a pretty big list of them.
The Python build-in library
collections for the rescue.
deque avoids storing all the elements (because of maxlen=0).
The code I ran in my shell_plus:
import collections from tqdm import tqdm iterator = map(lambda x: x.save(), MyModel.objects.all()) with tqdm(iterator, total=MyModel.objects.count(), ascii=True) as pbar: collections.deque(pbar, maxlen=0)
For really big tables, the initial creation of the iterator will take quite some time too!
To save memory on really large tables
Currently I am processing wikidata a lot. The trouble here is that the current json dump (wikidata-20190930-all.json.bz2) is 38GB big and storing a decompressed version has more I/O than using the CPU to decompress the file.
I need to process the full wikidata dump a few times. To get some kind of savepoints the best way was to split the file into 100k lines files. Then I can resume at a specific 100k-step point.
To split I used this command:
bzcat wikidata-20190930-all.json.bz2 | pv | split -l 100000 -d -a 4 --filter='bzip2 > $FILE.json.bz2' - split-
First step is of course bzcat to decompress the bz2 part. Second step is pv which adds some kind of progress. Here especially how many GiB were already processed and with what MiB/s rate. Third step is the split operation. The split is by 100000 lines, the numbers added are decimal and with 4 digits. The most interesting part is the filter which is the command the splitted data is wrote to. So the resulting splitted file is recompressed using bzip again. The variable $FILE if the file name generated by split for this splitting step. The - is to read from stdin, which is the data of the commands before and the last parameter is the filename in front of the newly created files.