What is attention and how it works ?

8 min readOct 10, 2023

Artificial Intelligence as we generally talk about is actually machine learning. Since Artificial Intelligence or AI is a catchy phrase so people prefer to use it for machine learning.

If we look at the definition of Machine Learning then it is like :

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” — Tom Mitchell

This is a very clear definition of machine learning and covers almost all the important components such as “learning from experience” and “improvement in performance in doing some task with experience”.

In the above definition “experience” means data. Now there can be many type of learning and one of the most common is look at the example and learn from those. This is what is done in supervised machine learning. However, we can also learn by selectively paying attention to some parts of data for a particular task. Basically what we are learning is what is relevant and what is not and this is broadly what “attention” achieve Now let us look at the technical part.

If we look at any form of data we can notice one important features — the ordering of data points as we receive/process is irelevent or it is important. For example, if we are making a list of items being sold in departmental store on a single day then it is not important in which order we keep the details of those. Or the exam marks of students. Now there is kind of data where the ordering matters such as in time series data and text. Since, here we will mostly focus on textual data so let us look at some of the important properties of textual data.

The meaning of a sentence (sementics) and grammatical structure (syntax) depend on the ordering of the words in the sentence. As I have discussed in my earlier article on sequence to sequence models that LSTM based Encoder-Decoder perfectly works for the data like text, where we ordering matters and so we need memory. However, there are two limitations of the sequence to sequence to models.

Sequence to sequence models are perfectly sequential in nature so it is very difficult to parallelize those and so we get performance bottleneck.
Since sequence to sequence model squeezes all the information of the input sentence for the case of language translation into a fixed size content vector so it fails to capture information from the past words/tokens for long sentences. Or these models have limited memory.

In order to overcome these issues attention mechanism was proposed. There were two main stages through which the attention development went through. In the first phase only the long term memory issues of sequence to sequence model was addressed by employing an attention mechanism to the model. There are few papers which proposed this scheme and one of those was https://arxiv.org/abs/1409.0473 with title “Neural Machine Translation by Jointly Learning to Align and Translate”

Before I discuss attention mechanism it is important to understand the following key equations:

Bayesian formalism is extremely important in the statistical modeling of text or language. Any sentence is made of tokens or words in the form of a sequence (ordering is important). Language models as shown in the equations below, give us probabilities of the different sequence of tokens. For example, Prob (“The sky is blue”) may be 0.34 but the Prob(“Green under the pen”) may be just 0.003.

Here we must understand two kind of probabilities: 1) join probabilities and 2) conditional probability. In the above equation P(A,B) is the joint probability : probability of event ‘A’ and ‘B’ taking place together. For example, for a six face dice probability (“the number is even”,”the number is larger than 5") is 1/6, since only ‘6’ satisfies this condition.

Conditional probabilities P(A|B) are used when we require some event ‘A’ to happen given that some other event ‘B’ has already happened.

In language modeling we are interested in predicting the probabilities for the next word. Let us assume the size of our vocabulary is 10,000 and we want to predict the next word for the sentence “The sky is …”. Our language model must provide us the probabilities for all the words in our vocabulary and from that we can predict the word for which the probability is maximum. In this case most probably it may be “blue”.

Reference : https://arxiv.org/abs/1409.0473

Equation (4) in the above picture shows the conditional probability for a sequence of tokens P(y[i]) given that we already have the tokens y[1], y[2],..,y[i-1] for a given input x[i].

If you remember the sequence to sequence model, you may also remember that when decoder generates next token it uses the previous token, the hidden state and the context vector (as shown by ‘c_i’ in the above equation).

There is a serious problem with the context vector we get from encoder that we use for generating the tokens from decoder — it does not have separate information about the different tokens/words we have given to the encoder for translation. Or in the other words squeezing all the information about the input sentence in machine translation is a bottleneck.

In order to fix the issue of fixed size contact vector and to incorporate the information about all the input tokens which we use in the encoder attention mechanism was produces. In this mechanism in place of using the context vector from the hidden state of the last encoder token, we take the weighted sum of all the hidden states (corresponding to all the input tokens). We let the system learn the weights or ‘attention coefficients’ during the training period. Attention mechanism helps to focus on particular words when generating the token from the decoder, for example, when doing the translation.

Attention coefficients in Language Translation

The above picture shows the heat map of the attention coefficients for the English-french translation. Top tokens are the tokens we give to the encoder and at the left from top to bottom are the tokens which are produced by the decoded. Attention coefficients increase with the brightness of the square in the heat map. For example, when decoder is producing ‘accord’ it may pay more attention to the word ‘Agreement’ and the same for 1992 also which is common to both the input sequence passes to the encoder and the output sequence passed to the decoder.

Here is a full sequence to sequence code with attention that can be used to translate from English to french.

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Attention
from tensorflow.keras.utils import plot_model
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
from tensorflow.keras.optimizers.legacy import Adam

input_data_path = r"C:\Users\jayanti.prasad\Data\NLP_DATA\seq2seq_data\fra.txt"

def build_vocabulary(input_data, output_data):
     V = []
     for data in [input_data, output_data]:
        tokens = " ".join (data).split(" ")
        vocab = Counter (tokens)
        token_dict = dict (vocab)
        df = pd.DataFrame (columns=['word','count'])
        df ['word'] = list (token_dict.keys())
        df ['count'] = list (token_dict.values())
        df = df.sort_values (by=['count'], ascending=False, ignore_index=True)
        V.append (df)
     return V

def text2vec (input_text, df_vocab, vocab_size, vec_len):
    D = df_vocab.iloc[:vocab_size]
    ids = [i for i in range (0, len (D))]
    D = D.assign (id=ids)
    D.index = D['word'].to_list()
    text_vecs = []
    for text in input_text:
        words = text.split(" ")
        words = [w for w in words if w in D.index]
        vec = [D.loc[w]['id'] for w in words]
        text_vecs.append (vec[:vec_len])
    return text_vecs     


class LSTM_Attention:
    def __init__(self,workspace_dir,num_encoder_tokens, num_decoder_tokens, latent_dim):
        self.workspace = workspace_dir
        self.model_dir = self.workspace + os.sep + "trained_model"
        self.log_dir = self.workspace + os.sep + "log"
        os.makedirs(self.workspace, exist_ok=True)
        os.makedirs(self.model_dir, exist_ok=True)
        os.makedirs(self.log_dir, exist_ok=True)
        self.num_encoder_tokens = num_encoder_tokens
        self.num_decoder_tokens = num_decoder_tokens
        self.latent_dim = latent_dim 
        self.build_model()

    def build_model (self, ):
        # Define the input sequence
        encoder_inputs = Input(shape=(None, self.num_encoder_tokens),name='Encoder-Input-Layer')
        # LSTM layer
        encoder = LSTM(latent_dim, return_sequences=True, return_state=True,name='Encoder-LSTM')
        encoder_outputs, state_h, state_c = encoder(encoder_inputs)

        # We discard `encoder_outputs` and only keep the states
        encoder_states = [state_h, state_c]
        # Set up the decoder, using `encoder_states` as initial state
        decoder_inputs = Input(shape=(None, self.num_decoder_tokens))
        # We set up our decoder to return full output sequences,
        # and to return internal states as well. We don't use the
        # return states in the training model, but we will use them in inference
        decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True,name='Decoder-LSTM')
        decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)

        # Attention layer
        attention = Attention(name='Attention-Layer')
        attention_output = attention([encoder_outputs, decoder_outputs])

        # Dense layer
        decoder_dense = Dense(num_decoder_tokens, activation='softmax',name='Decoder-Dense')
        decoder_outputs = decoder_dense(attention_output)
        # Define the model
        self.model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

    def fit_model(self, encoder_in_data, decoder_in_data, decoder_target_data, nepochs):
        batch_size = 64 
        chkpt = ModelCheckpoint(filepath=self.model_dir + os.sep + "model.hdf5",
             save_weights_only=True, monitor='val_loss', mode='min', save_best_only=True)

        tboard = TensorBoard(log_dir=self.log_dir)
        callbacks = [chkpt, tboard]
        optimizer=Adam(lr=0.01, beta_1=0.9, beta_2=0.999, decay=0.001)
        # Compiling and training the model
        self.model.compile(optimizer='RMSprop',metrics=['accuracy'],
                     loss='categorical_crossentropy')
        hist = self.model.fit([encoder_in_data, decoder_in_data],
            decoder_target_data, callbacks=callbacks, batch_size=batch_size, epochs=nepochs,
            validation_split=0.2)
         return hist

     
if __name__ == "__main__":
    workspace = "tmp"
    max_encoder_vec_len = 8
    max_decoder_vec_len = 8
    latent_dim  = 300
    num_epochs = 40

    with open(input_data_path, 'r', encoding='utf-8') as f:
        lines = f.read().split('\n')

    input_texts = []
    output_texts = []
    for i in range (1, 20000):
        row = lines[i].split('\t')
        if row[0] not in input_texts:
           input_texts.append (row[0])
           output_texts.append ("__START__ " + row[1] + " __STOP__")

    print("number of data points:", len(input_texts))

    [vocab_in, vocab_out ] =  build_vocabulary(input_texts, output_texts)
     
    vocab_in.to_csv(workspace + os.sep + "input_vocab.csv")
    vocab_out.to_csv(workspace + os.sep + "output_vocab.csv")

    vocab_in = vocab_in [vocab_in['count'] > 20]
    vocab_out = vocab_out [vocab_out['count'] > 20]

    num_encoder_tokens = len (vocab_in)
    num_decoder_tokens = len (vocab_out)
    print("Num encoder token",num_encoder_tokens)
    print("Num decoder tokens", num_decoder_tokens)

    M = LSTM_Attention (workspace, num_encoder_tokens, num_decoder_tokens, latent_dim) 
    print(M.model.summary()) 

    plot_model (M.model)  
    
    input_vecs  = text2vec (input_texts, vocab_in,  num_encoder_tokens, max_encoder_vec_len)
    output_vecs = text2vec (output_texts, vocab_out, num_decoder_tokens, max_decoder_vec_len)

    encoder_in_data = np.zeros((len(input_vecs), max_encoder_vec_len, num_encoder_tokens), dtype='float32')
    
    decoder_in_data = np.zeros((len(output_vecs), max_decoder_vec_len, num_decoder_tokens), dtype='float32')
    decoder_target_data = np.zeros((len(output_vecs), max_decoder_vec_len, num_decoder_tokens), dtype='float32')

    for i in range (0, len (input_vecs)):
       if  (len (input_vecs[i]) > 0)  & (len (output_vecs[i]) > 0): 
           for j, token_id in enumerate (input_vecs[i]):
               encoder_in_data[i, j, token_id] = 1
           for j, token_id in enumerate (output_vecs[i][1:]):
               decoder_in_data[i, j, token_id] = 1
           for j, token_id in enumerate (output_vecs[i][:-1]):
               decoder_target_data[i, j, token_id] = 1
   
    # Model Summary
    print(M.model.summary())

    print("encoder_in_data shape:", encoder_in_data.shape)
    print("decoder_in_data shape:", decoder_in_data.shape)
    print("decoder_target_data shape:", decoder_target_data.shape)
    print("num_epcohs")
    hist = M.fit_model(encoder_in_data, decoder_in_data, decoder_target_data,num_epochs)

    fig, axs = plt.subplots(2, 1, figsize=(8, 6))
    axs[0].plot(hist.history['loss'], label='Training Loss')
    axs[0].plot(hist.history['val_loss'], label='Validation Loss')
    axs[0].legend()
    axs[1].plot(hist.history['accuracy'],label='Training Accurcay')
    axs[1].plot(hist.history['val_accuracy'],label='Validation Accurcay')
    axs[1].legend()

    plt.show()

The architecture of the model is as follows:

In order to run this code you will need to have the data file ‘fra.txt’ that is available at http://www.manythings.org/bilingual/

This is just the training compoent & you can use the same inference component as I have given here : https://prasad-jayanti.medium.com/sequence-to-sequence-model-2c9d0fc4808

Without tunning the hyper-parameters and clearning the data we do not expect very good performance which is not the aim here.

This article is part of series so keep checking. If you find this useful plese comment, like & share.

What is attention and how it works ?

Written by Jayanti prasad Ph.D

Responses (1)