AllenNLP: sequence to sequence attention plots

As follow-up to last weeks post about implemting a reversing sequence-to-sequence model in AllenNLP, this post is about visualizing the attention.

Same as last week this post was tested with Python 3.7 and AllenNLP 0.8.4.
And all code is in this repository: https://github.com/mfa/allennlp-reverse-seq2seq/

To get the informations needed to plot the attentions a few methods of the SimpleSeq2Seq class in simple_seq2seq.py have to be modified.

The lines changed compared to version 0.8.4 of AllenNLP:

diff --git a/simple_seq2seq.py b/simple_seq2seq.py
index 849da8a..9c7e3da 100644
--- a/simple_seq2seq.py
+++ b/simple_seq2seq.py
@@ -323,6 +323,7 @@ class SimpleSeq2Seq(Model):

         step_logits: List[torch.Tensor] = []
         step_predictions: List[torch.Tensor] = []
+        attn: List[torch.Tensor] = []
         for timestep in range(num_decoding_steps):
             if self.training and torch.rand(1).item() < self._scheduled_sampling_ratio:
                 # Use gold tokens at test time and at a rate of 1 - _scheduled_sampling_ratio
@@ -338,6 +339,8 @@ class SimpleSeq2Seq(Model):

             # shape: (batch_size, num_classes)
             output_projections, state = self._prepare_output_projections(input_choices, state)
+            if not self.training:
+                attn.append(torch.squeeze(state["attention_weights"]))

             # list of tensors, shape: (batch_size, 1, num_classes)
             step_logits.append(output_projections.unsqueeze(1))
@@ -358,6 +361,9 @@ class SimpleSeq2Seq(Model):

         output_dict = {"predictions": predictions}

+        if not self.training:
+            output_dict["attentions"] = torch.unsqueeze(torch.stack(attn), 0)
+
         if target_tokens:
             # shape: (batch_size, num_decoding_steps, num_classes)
             logits = torch.cat(step_logits, 1)
@@ -412,7 +418,8 @@ class SimpleSeq2Seq(Model):

         if self._attention:
             # shape: (group_size, encoder_output_dim)
-            attended_input = self._prepare_attended_input(decoder_hidden, encoder_outputs, source_mask)
+            attended_input, input_weights = self._prepare_attended_input(decoder_hidden, encoder_outputs, source_mask)
+            state["attention_weights"] = input_weights

             # shape: (group_size, decoder_output_dim + target_embedding_dim)
             decoder_input = torch.cat((attended_input, embedded_input), -1)
@@ -451,7 +458,7 @@ class SimpleSeq2Seq(Model):
         # shape: (batch_size, encoder_output_dim)
         attended_input = util.weighted_sum(encoder_outputs, input_weights)

-        return attended_input
+        return attended_input, input_weights

     @staticmethod
     def _get_loss(logits: torch.LongTensor,

The same diff in the github repository: https://github.com/mfa/allennlp-reverse-seq2seq/commit/d9ca4c9c5f8f489b14f091a974b7e7a9cdbd7fef

Additionally the class name and registered name are changed to avoid duplicate naming.
Now we have to use the new model in the configuration and train with an additional parameter:

allennlp train configurations/reverse_starting_point.json -s output --include-package library
For prediction we need a custom predictor that adds source and target sequence to plot them later.
To predict and than generate the plots run:

allennlp predict output/model.tar.gz --use-dataset-reader examples.csv --predictor my_seq2seq --output-file output/examples.output --include-package library
python tools/plot_attention.py

One example plot looks like this:

plot

AllenNLP: reverse sequence example

The tutorial of joeynmt inspired me to replicate their tutorial using AllenNLP.

This post was tested with Python 3.7 and AllenNLP 0.8.4.

Data

Download the generator script from joeynmt

mkdir -p tools; cd tools
wget https://raw.githubusercontent.com/joeynmt/joeynmt/master/scripts/generate_reverse_task.py
cd ..

As prerequisite to run the script you should already have installed allennlp (for numpy).

mkdir data; cd data
python ../tools/generate_reverse_task.py
cd ..
For the seq2seq datareader of AllenNLP the data needs to be converted to tabulator seperated csv.
paste should be part of every unix installation.

cd data
paste dev.src dev.trg > dev.csv
paste test.src test.trg > test.csv
paste train.src train.trg > train.csv
cd ..

Configuration

This configuration is as close as possible to the one given by the joeynmt tutorial (reverse.yaml).

{
  "dataset_reader": {
    "type": "seq2seq",
    "source_tokenizer": {
      "type": "word"
    },
    "target_tokenizer": {
      "type": "word"
    }
  },
  "train_data_path": "data/train.csv",
  "validation_data_path": "data/dev.csv",
  "test_data_path": "data/test.csv",
  "model": {
    "type": "simple_seq2seq",
    "max_decoding_steps": 30,
    "use_bleu": true,
    "beam_size": 10,
    "attention": {
      "type": "bilinear",
      "vector_dim": 128,
      "matrix_dim": 128
    },
    "source_embedder": {
      "tokens": {
        "type": "embedding",
        "embedding_dim": 16
      }},
      "encoder": {
        "type": "lstm",
        "input_size": 16,
        "hidden_size": 64,
        "bidirectional": true,
        "num_layers": 1,
        "dropout": 0.1
      }
    },
    "iterator": {
      "type": "bucket",
      "batch_size": 50,
      "sorting_keys": [["source_tokens", "num_tokens"]]
    },
    "trainer": {
      "cuda_device": 0,
      "num_epochs": 100,
      "learning_rate_scheduler": {
        "type": "reduce_on_plateau",
        "factor": 0.5,
        "mode": "max",
        "patience": 5
      },
      "optimizer": {
        "lr": 0.001,
        "type": "adam"
      },
      "num_serialized_models_to_keep": 2,
      "patience": 10
    }
}

Change cuda_device to -1 if you have no GPU.

How to train

allennlp train reverse_configuration.json -s output

How to predict one sequence

echo '{"source": "15 28 32 4", "target": "4 32 28 15"}' > reverse_example.json
allennlp predict output/model.tar.gz reverse_example.json --predictor simple_seq2seq

Results in:

{
  "class_log_probabilities": [-0.000591278076171875, -9.015453338623047, -9.495574951171875, -9.83004093170166, -10.022026062011719, -10.089068412780762, -10.098409652709961, -10.247438430786133, -10.416641235351562, -10.431619644165039],
  "predictions": [[24, 17, 22, 4, 3], [24, 16, 22, 4, 3], [24, 15, 22, 4, 3], [24, 35, 22, 4, 3], [24, 17, 6, 4, 3], [24, 17, 14, 4, 3], [24, 17, 22, 12, 3], [24, 17, 22, 10, 3], [24, 17, 22, 5, 3], [24, 11, 22, 4, 3]],
  "predicted_tokens": ["4", "32", "28", "15"]
}

Some training insights

The plots (using tensorboard) showing expected training loss and validation loss. The BLEU score is nearly 1.0.

Training loss:

train_loss

Validation loss:

val_loss

BLEU:

bleu

Future work

The attention visualizations shown in the joeynmt tutorial are not yet implemented in AllenNLP. This will be a future blog post (hopefully).

SSH jump host

I am using a server in my tinc to jump inside a private network that is behind a NAT router.

Using ssh commandline:

ssh -J 10.1.7.25 root@192.168.1.44

Where 10.1.7.25 is the jump host and root@192.168.1.44 is the server in the private network.

Or by using the .ssh/config file

Host jump
  Hostname 192.168.1.44
  User root
  ProxyJump 10.1.7.25

And call this using ssh jump.