Use Parallel to split by line – follow up

After using the parallel pattern described in my previous blog post a few times and improved the speed by quite a bit I had to write a follow up.

The previous solution with -N 1 had too much IO wait. The easiest way to solve this was to increase to a larger chunk of lines, i.e. -N 1000.

For this to work the script has to cope with more than one line. The example from the last post now looks like this:

import json
import sys

def extract(lines):
    for line in lines:
        line = line.strip()
        if line.endswith(","):
            # if the file is json and not jsonl
            line = string[:-1]
        if line.startswith(("[", "]")):
            continue
        data = json.loads(line)
        value = data.get("value")
        if value:
            print(f"X:{value}", flush=True)

extract(sys.stdin.readlines())

The important changes are:

  • readlines() instead of readline() to get all lines given by stdin

  • a loop over the lines instead of only one string

  • continue instead of return to process the remaining lines too

The speedup was for some of my bigger files more than 10x.

Use Parallel to split by line

The problem: process a big jsonl file line by line. The source file is bz2 compressed to save disk space.

Now use GNU parallel to split the file by line and run a command for every line.

bzcat bigfile.json.bz2 | parallel -j 16 --pipe --block 100M -N 1 python extract.py | gzip > output_`date +%s`.csv.gz

Parallel command options used:

-j 16

spawn 16 processes in parallel

--pipe

use pipe mode instead of the default argument mode

--block 100M

increase the block size from the default of 1M to 100M to be sure to get the full dataset -- this is maybe not necessary

-N 1

always give one dataset to the called process

Python usage example

The python file reads from stdin and extracts the wanted information, for example:

import json
import sys

def extract(string):
    string = string.strip()
    if string.endswith(","):
        # if the file is json and not jsonl
        string = string[:-1]
    if string.startswith(("[", "]")):
        return
    data = json.loads(string)
    value = data.get("value")
    if value:
        print(f"X:{value}", flush=True)

extract(sys.stdin.readline())

Google Cloud Translate API Batch requests

After spending some time getting Google Cloud Translate API with batch requests running, I document this here for future me.

This step-by-step post needs Google Cloud SDK installed!

First the API needs to be activated.

Second, we need a way to authenticate. I chose a service-account with the rights to use the translate API and to write to Google Cloud Storage. The service-account is downloaded as a json file and the filename has to be set as an environment variable, i.e.

export GOOGLE_APPLICATION_CREDENTIALS=your-projectid-123456-d6835a365891.json

The API request is a json file too. This file has a specified structure. Mine looked like this:

{
   "sourceLanguageCode": "en",
   "targetLanguageCodes": ["ja"],
   "inputConfigs": [
     {
       "gcsSource": {
         "inputUri": "gs://YOUR-STORAGE-BUCKET/input/inputdata.tsv"
       }
     }
   ],
   "outputConfig": {
       "gcsDestination": {
         "outputUriPrefix": "gs://YOUR-STORAGE-BUCKET/output/"
       }
    }
 }

Then I uploaded the inputdata.tsv to Google Cloud Storage. I used the webinterface, but gsutil -m cp inputdata.tsv gs://YOUR-STORAGE-BUCKET/input/ should work too.

And now finally the request to translate the tsv file.

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://translation.googleapis.com/v3/projects/<PROJECT_ID>/locations/us-central1:batchTranslateText

Replace request.json with the filename of your json file (see above) and <PROJECT_ID> with the id of your Google Cloud Project.

The command returns the operation-id, i.e.

{
  "name":
      "projects/123456/locations/us-central1/operations/20210406-15021617746540-606bd714-0000-2d87-9290-001a114b3fbf",
  "metadata": {
      "@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateMetadata",
      "state": "RUNNING"
  }
}

This operation-id can be used to get the status the translation request:

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) https://translation.googleapis.com/v3/projects/<PROJECT_ID>/locations/us-central1/operations/20210406-15021617746540-606bd714-0000-2d87-9290-001a114b3fbf

For example:

{
  "name": "projects/123456/locations/us-central1/operations/20210406-15021617746540-606bd714-0000-2d87-9290-001a114b3fbf",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateMetadata",
    "state": "RUNNING",
    "totalCharacters": "19121",
    "submitTime": "2021-04-06T22:11:31Z"
  }
}

When finished the result can be downloaded from Google Cloud Storage via gsutil, i.e.

gsutil -m cp \
  "gs://YOUR-STORAGE-BUCKET/output/index.csv" \
  "gs://YOUR-STORAGE-BUCKET/output/YOUR-STORAGE-BUCKET_input_inputdata_ja_translations.tsv" \
  .