Google Cloud Translate API Batch requests

After spending some time getting Google Cloud Translate API with batch requests running, I document this here for future me.

This step-by-step post needs Google Cloud SDK installed!

First the API needs to be activated.

Second, we need a way to authenticate. I chose a service-account with the rights to use the translate API and to write to Google Cloud Storage. The service-account is downloaded as a json file and the filename has to be set as an environment variable, i.e.

export GOOGLE_APPLICATION_CREDENTIALS=your-projectid-123456-d6835a365891.json

The API request is a json file too. This file has a specified structure. Mine looked like this:

{
   "sourceLanguageCode": "en",
   "targetLanguageCodes": ["ja"],
   "inputConfigs": [
     {
       "gcsSource": {
         "inputUri": "gs://YOUR-STORAGE-BUCKET/input/inputdata.tsv"
       }
     }
   ],
   "outputConfig": {
       "gcsDestination": {
         "outputUriPrefix": "gs://YOUR-STORAGE-BUCKET/output/"
       }
    }
 }

Then I uploaded the inputdata.tsv to Google Cloud Storage. I used the webinterface, but gsutil -m cp inputdata.tsv gs://YOUR-STORAGE-BUCKET/input/ should work too.

And now finally the request to translate the tsv file.

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://translation.googleapis.com/v3/projects/<PROJECT_ID>/locations/us-central1:batchTranslateText

Replace request.json with the filename of your json file (see above) and <PROJECT_ID> with the id of your Google Cloud Project.

The command returns the operation-id, i.e.

{
  "name":
      "projects/123456/locations/us-central1/operations/20210406-15021617746540-606bd714-0000-2d87-9290-001a114b3fbf",
  "metadata": {
      "@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateMetadata",
      "state": "RUNNING"
  }
}

This operation-id can be used to get the status the translation request:

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) https://translation.googleapis.com/v3/projects/<PROJECT_ID>/locations/us-central1/operations/20210406-15021617746540-606bd714-0000-2d87-9290-001a114b3fbf

For example:

{
  "name": "projects/123456/locations/us-central1/operations/20210406-15021617746540-606bd714-0000-2d87-9290-001a114b3fbf",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.translation.v3.BatchTranslateMetadata",
    "state": "RUNNING",
    "totalCharacters": "19121",
    "submitTime": "2021-04-06T22:11:31Z"
  }
}

When finished the result can be downloaded from Google Cloud Storage via gsutil, i.e.

gsutil -m cp \
  "gs://YOUR-STORAGE-BUCKET/output/index.csv" \
  "gs://YOUR-STORAGE-BUCKET/output/YOUR-STORAGE-BUCKET_input_inputdata_ja_translations.tsv" \
  .

Google Cloud Storage cleanup

I got quite some storage usage since using Google Cloud Run for a growing number of applications in the last months.
So I need to cleanup a bit. Regularly.

If you have only a few applications this can be done manually on the Google Cloud Registry website: https://console.cloud.google.com/gcr/.
But I want to automate this.

First step is to get to know what is there:

gcloud container images list --repository eu.gcr.io/$PROJECT_ID

This lists all services with container images in my europe registry folder.

Next list all images in one of the folders:

gcloud container images list-tags $IMAGE --limit=unlimited

$IMAGE is the full name of the folder, i.e. eu.gcr.io/$PROJECT_ID/helloworld

This returns a list of images with tags and date, i.e.:

DIGEST        TAGS    TIMESTAMP
101211eecef5  latest  2021-01-19T14:01:04
83e3f7541cf5          2021-01-19T13:55:24

To delete an image we need the SHA256 of the image.

Same command as before but with a --format at the end.

gcloud container images list-tags $IMAGE --limit=unlimited --format='get(digest)'

The result is only the sha256 without tags and date. So quite hard to decide if I want to delete it. :(

Now delete one of the images, for safety reasons I chose the oldest one:

gcloud container images delete -q --force-delete-tags \
  "gcr.io/$PROJECT_ID/helloword@sha256:b280f4f858492e50176470c26edb9cd4903cf69dc78070c806340edc1a1c84bc"

Doing this manually feels wrong and is quite time consuming.

But others had this problem before.
A good starting point for me was this gist: https://gist.github.com/ahmetb/7ce6d741bd5baa194a3fac6b1fec8bb7.
I made it a bit more verbose, but kept the system of deleting until a given date.
Some forks improved the gist to keep a given number of images.

IMDB Ratings to SQLite

A few weeks ago I demoed Datasette on a small private conference.
One of my examples was the IMDB ratings CSV file converted to SQLite using csvs-to-sqlite.
The resulting SQLite database was usable but I was missing some facets and filtering possibilities.
To fix this I created imdb-to-sqlite.

IMDB to SQLite add the genres and directors as array fields.
Additionally year_rating and weekday_rating to filter more easy on year and weekday.
The filters would be possible using SQL but adding them as fields seems easier.

Try imdb-to-sqlite with your IMDB ratings and add issues if something should be added.