Learning about the Attention Mechanism and the Transformer Model

Baptiste Amato, Alexis Durocher, Gabriel Hurtado, Alexandre Jouandin, Vincent Marois

Georgia Tech - Spring 2019 CS 7643 Deep Learning Class Project - Prof. Zsolt Kira


We propose an in-depth analysis and reimplementation of the Transformer model (Vaswani et al., NIPS 2017). Its non-recurrent behavior and sole use of attention makes it an intriguing model to analyze. We perform a hyper-parameters search, as well as a memory-profiling study, both of these allowing us to successfully train and semantically evaluate the model on the IWSLT TED Translation task. Our experiments further enable us to detail particular insights on the behavior of the model and its training process. This article is aligned with the current question of reproducibility in Deep Learning Research.

Teaser figure

Here is an image of the Transformer architecture, drawn from here:

The architecture of the Transformer model.

Introduction, Background and Motivation

The Transformer model, published in 2017 by Vaswani et al (cite). has established state-of-the-art results in Neural Machine Translation, using an Encoder-Decoder architecture which does not present recurrence, breaking with previously established models (the same motivation can be found in other models, such as ConvS2S (cite) and ByteNet (cite)), supposedly yielding major improvements in computational complexity.

Additionally, it only relies on the attention mechanism (cite) to compute representations of its inputs and outputs. Intrinsically, this makes it an exciting problem to study. Moreover, several new models using its architecture have been published (Universal Transformers (cite), OpenAI's GPT-2 (cite), and more recently their Sparse Transformer (cite)), reinforcing its interest as a robust neural brick for complex models.

The Attention Mechanism

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

Dot-product attention can be written as: $$Attention(Q, K, V) = Softmax(QK^T)V$$ The paper proposes scaled dot-product attention, where the dot product is divided by the dimensionality of the keys.

A sole instance of scaled dot-product attention is not sufficient to track the various dependencies between distant positions in the sequence. This is counter-acted with multi-head attention, where the attention is the concatenation of 8 parallel dot-product attention heads. This allows the model to jointly attend to information from different representation subspaces at different positions.

The Transformer's architecture

The Transformer model can be described as follows:

  • It follows an encoder-decoder architecture, well-known for sequences transduction tasks, such as neural machine translation,

  • Both the encoder and the decoder are a stack of N identical layers,

  • The Encoder layer is constituted of a self multi-head attention block (i.e. where the queries, keys and values are all equal), followed by a feed-forward block. Both blocks are augmented by residual connections.
  • The Decoder layer is constituted of a self multi-head attention block, followed a multi-head attention block linking the output of the encoder stack (i.e. the queries come from the Decoder, while the values & keys come from the Encoder), followed by a feed-forward block. All blocks are augmented by residual connections.
  • A final softmax classifier generates next-token probabilities on the output vocabulary set.
  • The model contains 65M parameters, all learnable.

The model's hyper-parameters

The model contains the following hyper-parameters:

  • N: The number of layers in the encoder & decoder stack (default: 6),
  • d_model: the overall dimensionality of the model (for the attention vectors, inputs / outputs shape etc.) Set to 512.
  • n_head: Number of heads in the multi-head attention blocks.
  • dropout: Dropout is used at several locations in the model (e.g. in the residual connections). Set to 0.1.

N, d_model and n_head impact the complexity of the model. The original paper actually mentions 2 versions of the Transformer:

  • A base version, with the default values as above,
  • A big version, where d_model = 1024 and n_head = 16*.

Yet, while the paper is clearly written, several questions come to mind when reading it, and designing its implementation:

  • Is the model's behavior identical during training and inference? Indeed, given that the model is not recurrent, can it perform greedy decoding (i.e. step-by-step prediction of the next word, starting from an initial token)?
  • What are the mechanisms of the model to replace recurrence, if any?
  • What is the sensibility of the model's training to hyper-parameters?
  • Can the results shown in the paper be reproduced, or approached, with limited time and computational resources?
The problem

Therefore, the problem we are tackling in this project is centered around the issue of reproducibility. This question has recently been tracting more attention from the research community (for instance, a workshop at ICLR 2019 is dedicated to it), and is thus a valid research question.


Our objectives were thus to:

  • Deeply understand the architecture of the Transformer model,
  • Reimplement (correctly) the Transformer model,
  • Reproduce its training process and have the model converging to acceptable results,
  • Answer the above questions, i.e. provide additional food for thoughts on the model to the research community, from our student's perspective.

How is it done today, and what are the limits of current practice?

Several references can be indicated here:

  • The original paper, from Vaswani et al., published at NIPS 2017. They implemented the model in the tensor2tensor framework (cite, now part of Tensorflow), and trained it for 3.5 days on 8 GPUs. A limitation here is thus in terms of available computing resources. Realistically, we are not able to get access to similar machines, which motivates a deeper analysis of the training procedure, to understand which characteristics are necessary, and others optional.
  • The Annotated Transformer from Harvard's NLP group. This is an detailed walkthrough of the original paper, with code snippets showing how to implement the model's architecture. While this is a great resource to understand the model, its limitations concerns the training process. Indeed, it was not designed to perform a hyper-parameter search, or to be deployed on GPU instances for training. Thus, this will be an additional focus for us.
  • The Transformer is a relatively popular model, and several repositories propose a reimplementation in diverse frameworks (a rapid search shows ~150 repositories existing on GitHub).
Potential impact

Who cares? If you are successful, what difference will it make?

While we, from a realistic point of view, do not expect to propose significant improvements on the Transformer architecture, we do hope to provide significants insights on the model, its training & inference behavior, as well as a basis for further investigations. For instance, we have implement an hyper-parameter search (for both the training's parameters and the model structure): we hope this will serve to the research community for future models based on the Transformer.

The code repository can be found here.


We can now detail our approach.


From our initial dataset (IWSLT 2014 TED Translation), a word vocabulary is generated. Following, all encountered words are tokenized, so that close words are associated to the same token (i.e. index) and rarely used words are considered as "unknown" (replaced by a <unk> token).

The last step of data pre-processing is to embed the tokenized sentences: every tokenized word of the vocabulary is associated to a vector (of 512 units). Different embeddings methods exist (such as GloVe (cite) or Word2Vec (cite)); we followed the method of the paper: trainable, random initialized embeddings.

The model thus takes in a batch of tokenized sentences (as the embeddings look-up-tables changes with backpropagation).


A prediction consists of an estimated distribution over the output vocabulary. We use the KL Divergence loss to compute the "distance" between the model's predictions and the ground truth labels. Label smoothing (cite), i.e regularizing the model by penalizing over-confident output distributions, is used with an initial smoothing factor of 0.1.

The optimizer is an adapted version of Adam (cite), with a specific learning rate decay guided by the following: $$lrate = d_{\text{model}}^{-0.5}\cdot \min({step\_num}^{-0.5}, {step\_num} \cdot {warmup\_steps}^{-1.5})$$ The warmup_steps factor hence controls when the learning rate starts decaying after its initial increase.

Backpropagation is done after every step; we compute the validation loss at the end of each epoch.

Initial implementation and test on the copy task

We started by implementing the model (using PyTorch v1.0), dividing each architectural blocks (EncoderLayer, DecoderLayer, MultiHeadAttention, ResidualConnection etc.) into classes, and ensured all were passing simple unit tests (i.e. verifying the shape of the output of the forward pass, the absence of NaN etc.).

This took about a week and half to finish. At this stage, we decided to ensure that the flow of the overall model (i.e. a stack of 6 EncoderLayer, connected to a stack of 6 DecoderLayer, each layer having several sub-blocks) was correct. Thus, we wanted to verify if the model was able to fully converge on a simple algorithmic task: copying its inputs to outputs. Indeed, a working implementation of such a heavy model should not encounter any issue copying inputs to outputs.

We thus created a dataset generating random algorithmic sequences and trained the model on it. Doing so further helped us on 2 aspects of the training:

  • Adapt the code for CUDA support. This requires several steps, such as loading the model into GPU memory, converting the input tensors to CUDA types etc.
  • Debug the model. An aspect we initially struggled with was the use of masking in the model's forward pass. Masks are used for:
    • Hiding the padding elements in the batches. Indeed, a batch is constituted of several sequences (algorithmic or tokenized sentences), each having a different length. In order to have a fixed batch shape, the shorter sequences are padded (e.g. with 0s) to match the length of the longest one. A boolean mask is generated alongside the batch and passed to the model in order to hide padding elements.
    • Masking out subsequents elements of a sequence in the Decoder. As the model is not recurrent, the entire input sequence is passed at once in the Encoder, which does not cause problems (the model is thus able to learn how to relate words at different positions in the sentence). During training, the entire target sentence is fed to the Decoder (in a teacher forcing approach), essentially adding an extra dimension to the tensors (and thus benefiting from parallelism). Nevertheless, in order to simulate greedy decoding during training, but keep the advantage of the parallelism, a mask is created so that elements at position i in a sequence cannot be related to elements at position i+1 and above. This is done to prevent an invalid leftward information flow, and preserve the performance during inference (which is obligatorily step-by-step, i.e. we iteratively feed the prediction of the model back into the input of the Decoder, and cannot look ahead).

Training and Validation loss of the model on the copy task.

Figure 1: Training and Validation loss curves of the model on the copy task.

As expected, the model was able to quickly & fully converge on this task, which gave us confidence that the model's forward pass was (most likely) correct.

Training on the real dataset

We then processed to implement all missing requirements for the training, and optimize our implementation:

  • Collecting the statistics (loss, episode index etc.) to file and to TensorBoard,
  • Implement the dataset and analyze it. We explain our approach below,
  • Implement multi-GPU support (this is easily done using PyTorch's DataParallel mechanism),
  • Parameterize the training (specify a random seed, batch size etc.)
  • Profile and try to reduce the memory usage of the model.

Analyzing the dataset delivered some interesting insights. For specifications, we used the IWSLT 2014 TED Translation dataset (available in torchtext), with:

  • 220k training samples, 1025 for validation, 1305 for test,
  • Average sentence length: 20 (train) - 21 (val) - 19 (test),

Plotting the distribution of the samples with respect to the sequence length gave the following graphs:

Histogram of the training set with respect to the sequence length.

Figure 2: Histogram (normalized and cumulated) of the training set with respect to the sequence length.

We can notice a long-tail distribution on the sequence length, meaning that for a sequence length of 40 (over a max of 102), we get 90% of the training set. A shorter sequence length provides the benefit of a larger batch size (for the same memory use), which generally stabilizes the training.

Convergence issue

What problems did you encounter?

Having analyzed the dataset, we started by first running the model on 40% of the training set (corresponding to a sequence length of 18) for 15 epochs to observe if we could get a correct convergence with the default hyper-parameters as indicated in the paper. We expected the validation loss to plateau, if not increase, which is what we observed:

Histogram of the training set with respect to the sequence length.

Figure 3: Training (orange) and validation (blue) loss evolution over 10 epochs on 40% of the training set.

As can be seen, the validation loss clearly plateaus, and increases at the end of training, indicating an overfitting. Several causes could be pointed out here, such as a need for better regularization or more data.

Additionally, several hyper-parameters were present in the training, and we suspected that they would greatly influence the convergence of the model. We decided to perform a hyper-parameter search (using Google Cloud's Hyperparameter Tuning feature; please refer to the README of the code repository on more indications on how to run the hyper-parameters search).

We found a working combination of hyper-parameters and will analyze the training we obtained in the Experiments section.

Memory Use issue

Once the maximum sequence length fixed, we started experiments on GPUs. We quickly ran into issues of memory, the full model (6 layers in both the Encoder and Decoder) taking up a lot of space, leaving little for the batches, thus forcing a small batch size.

To solve this issue, we profiled the memory usage of the model's forward pass, to see where the bottleneck was. We present the results in the next section and, to the best of our knowledge, this represents an analysis not previously done for the Transformer (Is anything new in your approach?).

To address the issue, we also:

  • pinned the generation of the datasets to CPU, and only move the batch to CUDA memory when needed,
  • Implemented multi-gpu support. This allows a larger batch (e.g. of size 4*128 = 512) to be split on several GPUs, each receiving a chunk of the batch (e.g. 128 samples for 4 GPUs). Yet, this only partially solves the issue, as according to our understanding and analysis, one of the GPU devices behaves as the master node and thus sees a higher memory usage as it computes gradients and backpropagates them. The other GPUs (behaving as slave nodes) are only used for computing a forward pass. This may be linked to a PyTorch limitation.

Experiments and Results

How did you measure success? What experiments were used? What were the results, both quantitative and qualitative? Did you succeed? Did you fail? Why?

We present here our experiments and results.

Memory Profiling

We recorded the memory usage (on a Tesla M60 with 8Gb of memory) at different instants of the forward pass (before and after the encoder, when computing loss etc.) over an epoch, to observe where the memory increase happens.

The following graph shows the memory use evolution over the first 4 iterations:

Memory Use evolution over the first 4 iterations.

Figure 4: Detailed Memory Use evolution over the first 4 iterations.

Several remarks can be done here:

  • The first iteration shows memory increases caused by the Encoder and the Decoder, which indicates that the corresponding weight matrices are being loaded into memory.
  • The loss computation provokes a memory increase several times over the course of the first 4 iterations. Our hypothesis is that this is linked to the computations of the gradients, which takes up space. Yet, this space is not freed at each iteration, but appear to be cached, which is why the memory use continues to initially increase.
  • While it is not shown here, after the memory peak at ~7Gb, it stabilizes at 6Gb over the rest of the epoch. This indicates that PyTorch is most likely optimizing by caching up tensors initially, and reusing these afterwards for a higher efficiency (we may think that deallocating & reallooating memory at each iteration slows training down).

We have been able to alleviate the memory bottleneck by using GPUs with more memory (12Gb, and Colab recently added Tesla T4 with 16Gb of memory).

Hyper-parameter tuning

To obtain a set of parameters which would lead to convergence during training, we decided to perform a hyper-parameter tuning on a subset of the training parameters.

Here is the subset of parameters we tuned:

  • warmup: The warmup factor hence controls when the learning rate starts decaying after its initial increase.
  • smoothing: the KL Divergence loss uses label smoothing (cite). This parameter thus controls the probability of the ground-truth labels.
  • beta_0, beta_1: the parameters of the underlying Adam (cite) optimizer.

These parameters were tuned using Google's AI Platform. We initially used a validation loss summed over all batches of our validation, but this validation loss was impacted by the smoothing parameter that we tuned. As such, it wasn't very relevant to compare validation loss values from two different trials with different smoothing values.

A measure such as BLEU would mitigate this kind of issue: BLEU is "smoothing-agnostic", and considers the semantic consistency between the target and predicted sequence.

Nevertheless, this HyperParameter search steered our choice of values when training the final models.

Training & Validation loss of the final model.

Figure 5: A few hyper-parameter tuning trial results from Google's HyperTune.

We have initially rapidly observed a relation between warmup and smoothing, so we wanted to test several values for these 2 parameters. We added beta_0, beta_1 as well as we hypothesized that their values would be influenced by warmup.

The optimal validation loss was obtained for the following combination of the above parameters: $$warmup = 9505, smoothing= 0.241, \beta_0=0.916, \beta_1=0.907$$

Following, we trained a Transformer model from scratch using the above combination of parameters. We used a Tesla T4 with 16Gb of memory. We trained on 90% of the training set with a batch size of 128 over 25 epochs. Training took 8.5h (~21 minutes / epoch).

Training & Validation loss of the final model.

Figure 6: Training (orange) and validation (blue) loss evolution of the best model over 25 epochs on 90% of the training set.

This model reached a loss of 0.7574 on the validation at the end of training, with a best obtained loss of 0.6691 (at epoch 19). There thus has been a slight overfitting towards the end of the training, but this still represents a major improvement over our initial runs where the validation loss would quickly plateau.


In this section, we analyze the results and insights we have been able to gather.

Training, validation and testing workflow

The Transformer model is stated as being a non-recurrent deep-learning model. While this is true during training, this statement is, from our opinion, misleading during inference.

During training, recurrence is actually simulated by using a mask which hides the subsequent words of the sentence.

As the model is not recurrent, the entire input sequence is passed at once in the Encoder. This essentially adds an extra dimension to the model, allowing to benefit from parallelism for a lower computational cost.

The entire target sentence is also fed to the Decoder (in a teacher forcing approach). Nevertheless, in order to simulate greedy decoding, but keep the advantage of the parallelism of the model, a mask is created so that elements at position i in a sequence cannot be related to elements at position i+1 and above.

However, during inference, the model works in a greedy way, by iteratively predicting (and concatenating) each word of the sentence. Hence, the same (learned anf now fixed) function is recurrently applied to output the next word prediction until the end-of-sentence token is outputted. We thus lose the benefit of no recurrence here, which will slow inference down. We expect the inference time to be linear in the number of words to predict.

Semantically testing the model

The task at hand is to translate from French to English sentences.

To test the model's semantic performance, we use the BLEU (bilingual evaluation understudy) score (cite), to evaluate the quality of text which has been machine-translated from one natural language to another. The BLEU score thus compares (on different scales, i.e. considering variable-sized groups of words) a translation prediction and a target.

Another metric is simply the human evaluation. Below are several samples from the test dataset - i.e unseen by our model during training nor during hyperparameters fine-tuning. An informed reader can then judge the quality of the translation.

Sample 1:

French source: C'est un soleil placé avec l'origine , car le Japon est à l'Est de la Chine .
English groundtruth (target): This is a sun placed with the origin , because Japan lies to the east of China .

English traduction (output): It 's a sun put in place , because Japan is in the east of China .

BLEU score (normalized between 0 and 1): 0.3414

Sample 2:

French source: Si quelqu'un marche derrière , ça veut dire « suivre » .
English groundtruth (target): If someone walks behind , that is '' to follow . ''

English traduction (output): If someone walks behind it , it means '' follow . ''

BLEU score (normalized between 0 and 1) 0.4682


From the sample 1, we can see that the model correctly identified the topic, the structure and the semantic of the sentence. However, we also note that the model translated certain French words in a litteral manner, instead of using the more appropriate and possibly sophisticated English words. For instance, "C'est" and "est" have been translated to "It's" and "is" instead of the targets "This is" and "lies".

This might be due to the fact that direct French-English traductions such as "est" in "is" are much more frequent in our dataset (e.g. several sentences contain formations such as "Il est" and "He is"). The model was not trained enough to realize that given the "sun" subject, "est" should be translated to "lies" instead of "is".

From sample 2, we can see that the model again correctly identified the topic, the structure and the semantic of the sentence. Even more impressively, it managed to provide a sentence which may appear more understandable than the original target. Again, the model translated directly french words. Indeed, "ça veut dire" has been translated in "it means" instead of the target "that is". We believe that again, this is a matter of training occurences.

Overall, we find the translations to be relatively good, given the lower amount of training (25 epochs - 8.5h) and the complexity of this model.

Visualizing the Attention

Finally, we have written a script to visualize the different attention layers of the model.

Given the plurality and diversity of the use of attention in the Transformer model, we only display here some of visualizations matrices in the encoder and decoder.

We obtained the following attention matrices for the sample 1: 4 of the 8 heads in the first layer of the encoder stack.

4 of the 8 heads in the first layer of the encoder stack.

4 of the 8 heads in the last layer of the encoder stack.

4 of the 8 heads in the last layer of the encoder stack.

4 of the 8 heads in the first layer of the decoder stack (memory attention).

4 of the 8 heads in the first layer of the decoder stack (memory attention).

4 of the 8 heads in the last layer of the decoder stack (memory attention).

4 of the 8 heads in the last layer of the decoder stack (memory attention).

4 of the 8 heads in the first layer of the decoder stack (self attention).

4 of the 8 heads in the first layer of the decoder stack (self attention).

4 of the 8 heads in the last layer of the decoder stack (self attention).

4 of the 8 heads in the last layer of the decoder stack (self attention).

We can observe an evolution of the attention weights between the first and the last layer of both the encoder and the decoder. For instance, we see a "refinement" and sharpening of the attention weights between the first layer and the last layer in the memory attention block of the decoder. This implies that the decoder is able to iteratively, from the first to the last layer, refine the relationships between its predictions and the output of the encoder.

Interestingly, the inverse phenomenon is observed for the self attention block of the decoder. The attention here appears more "diluted" across the sentence, implying that each word now relates to a bigger part of the sentence in the last stack of the decoder than in the first.

Team Member Identification

All project members provided equal contribution in the undertanding of the Transformer model and project webpage writing.

Name Description
Baptiste Amato Multi-GPU Support & Cloud instances script setup, Implementation of the optimizer / loss / training loop, hyper-parameters search
Alexis Durocher Initial implementation of the main classes, Dataset implementation, memory profiling, attention visualization
Gabriel Hurtado Dataset implementation, Implementation of the optimizer / loss / training loop, final model training
Alexandre Jouandin Initial implementation of the main classes, Dataset implementation, Convergence on copy task, hyper-parameters search
Vincent Marois Implementation of the optimizer / loss / training loop, Convergence on copy task, memory profiling

Future work - Axes of Research

In this section we develop some ideas that we believe could be interesting to dig into, after this project.

  • First, we think that the loss used to train the Transformer - the KL divergence - may not be the most appropriate one for neural machine translation. Indeed, to the best of our knowledge, the KL divergence does not consider the semantic of the vocabulary to evaluate the performance of the model and hence adapt the gradient magnitude during training. Currently, considering a target word such as "king", a predicted word such as "prince", close semantically, would have the same rate of error as "machine" - which is further in meaning. Thus, considering semantic distances (e.g. synonyms) could possibly yield improvements in the translation quality. An other metric - such as BLEU - would consider the semantic and hence provide a better evaluation of the model performance. However, BLEU is not differentiable, hurting its capacity to be used in the context of backpropagation.
  • Finally, the smoothing factor used in the KL Divergence loss helps reducing the magnitude of the loss in the first iterations of training - i.e. when the error rate is relatively high - by "smoothing" the targets vector using an almost uniform distribution of non-confidence (ref. Vaswani et al.). The loss is thus reduced compared to no smoothing, as for a specific word target, the rest of the output vocabulary has a very little yet non-zero probability. We believe that adding a decaying smoothing factor during training could yield an interesting research axis. Progressively reducing the smoothing factor would keep the benefit of smoothing the loss value for the first epochs of training, without impacting the training in the last epochs, where the model hopefully reaches a high confidence level.
In [ ]: