What is Query, Key, Value (QKV) Attention ?

Jayanti prasad Ph.D
4 min readOct 10, 2023

The famous paper “ATTENTION IS ALL WHAT YOU NEED” writes :

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.”

In the first reading the above definition this definition looks like a sermon ! The statement does not have any reference so there is no point to trace back this definition. Authors did not feel to give any pseudo-code or motivation behind reaching this definition of attention. The attention I describe here is well motivated and easy to understand but the attention discussed in QKV paper is a mystery and the paper is very well criticized by this paper.

After going through many papers, tutorials and watching videos of experts this is what I understood abut QKV attention. If something does not fit well please let me know.

  • Technically any piece of text can be decomposed into a set of questions and their answers (you do not have to do it, neural networks can do!).
  • If we need answer to any question (query) we match that with questions belonging to differences pieces of text.
  • Once our query matches to a question we return the answer of the question as a response.

In the above discussion we can easily replace question and answer with ‘key’ and ‘value’ respectively.

Now here comes the concept of hard-indexing — when the ‘query’ exactly matches to a ‘key’ then we can call it hard indexing. Here what I will discuss is ‘soft indexing’ where the ‘query’ partially matches to the ‘key’.

Reference : https://jalammar.github.io/illustrated-gpt2/

Now let us come to our problem of text modeling — sequence to sequence or machine translation.

After going through an embedding layer (or using any other scheme of vectorization) all of our tokens are represented by a fixed size vectors or word vectors. This simply means all of our query, key and value are real vectors and so the matching between the query and key need not to be perfect (0 or 1) it can be anything between 0 and 1.

The most common matching could be something like scalar product between query and key vector. This is exactly how we can motivate the case for attention, based on ‘soft indexing’.

Look at the following algorithm:

QKV attention

Let us define the problem : Compute the ‘attention’ for a token ‘B’ from the token ‘A’ in a sentence. Since here both the tokens are from the same sentence so we can call this ‘self-attention’.

In order to find self attention we can map token ‘B’ to a ‘query vector’ using the following equation:

Q = W_Q * e_B + b_Q

Note that in the above equation ‘W_Q is matrix (weight), b_Q a (bias) vector and e_B is the embedding vector corresponding to token ‘B’

Now we can create two vector key (K) and value (V) for the token ‘A’ using the following equation:

K = W_K * e_A + b_K

V = W_V *e_A + b_V

Now we have three vector query (Q), key (K) and Value (V) and compute the attention function in the following way:

Attention (Q,K,V) = ((Q.K^T)/sqrt (d)) * V

Where ‘d’ is the dimensionality of the vector space and (Q.K^T) is the scalare product between the query and key that determines the strength of the attention. Now everything is clear and we can see what was given in the paper:

Somehow the attention function as discussed above quantities the dependency / relationship between any two tokens in a sentences. Since these two tokens can have more than one type of relationship — mango and apple are not just eatable both of them grow on tree also ! This logic motivates the case of Multi-head attention which is at the heart of transformer architecture on which many large language models are based on.

Multi-head attention

We can have many attention heads (set of W & b) and concatenate the output of all and make a large vector. we can apply (scalar product) the resultant attention vector to another vector W^O as is shown above, to get the attention value. Note that all of the matrices need to be learned during the training.

Note that QKV attention can be easily either between tokens of the inputs of encoder or that between tokens of the encoder and decoder. Look at what the paper says.

The scope of this article was just to explain the QKV attention as is used in transformer models. I will explain the full transformer model in an another article.

In you find this article useful please comment, like & share. Thanks

References :

  1. Attention is all you need https://arxiv.org/abs/1706.03762
  2. Formal Algorithms for Transformers https://arxiv.org/abs/2207.09238

--

--