Sequence to sequence model

Jayanti prasad Ph.D
16 min readSep 27, 2023

--

Sequence to sequence model or in short Seq2Seq is one of the most important neural network architectures that has been proposed. Sometime this architecture is also called Encoder-Decoder architecture. Here I will be using both the names. Seq2Seq model can be considered an important stepping stone towards Large Language Models or LLM, although most of the current LLMs are not based on Seq2Seq but are based on an another architecture, called ‘Attention’, that I will cover in one of the future articles.

There are many outstanding articles and tutorials on the implementation and use of Seq2seq so here I will mainly focus on cores ideas of this model, although, I will also given a full fledge implementation of this model in Keras/Tensorflow which you can copy from here or can get from my GitHub page. Before going to this about the model let me give some background.

In general, we have two kinds of data, the first in which the different data points are independent or are not correlated. For example, subject wise marks of students in an excel sheet or prices of different items in a grocery store etc. In this type of data we are free to exchange the positions of data points in the table. There are other type of data in which ordering is extremely important. For example, in a picture positions of pixels are important — if you exchange pixels from one part of the picture to the another the picture will get distorted. The same is true for Time Series data also in which what happens today very much depend on what happened yesterday, and what happens today will affect what will happen tomorrow. This association is particularly more important in case of text, where any word we use depend on its predecessors. The same is true for speech also in which there is a certain order in which audio signals are aligned. SeqSeq models cater this type of data — particularly one dimensional or sequential data, in which the ordering matters.

Apart from the association, there is an another property of data that makes sequential data different. In most cases when we have a dataset different data points are of the same size. For example, in tabular data all the rows have the same number of columns or in image data we can scale all the images to the same size without losing any information. As we can see that this is not case for textual data. For example, the sentences in a paragraph are not of the same size and we cannot truncate sentences without distorting their meanings and syntactical structure. Here SeqSeq Models come to rescue. Since the Seq2Seq models I plan to dicuss here are based on Recurrent Neural Network so let us look at some of the important properties of Recurrent Neural Networks (RNN).

Recurrent Neural Network (RNN)

Artificial Neural Networks (ANN) or in short Neural Networks (NN) are known as universal approximators because they can represent any kind of nonlinear mapping from input data to output data. Note that supervised machine learning is all about learning a mapping from the input to output data — for example, from image to its label etc. This is not a surprise that neural networks rule the word of machine learning, in particular neural networks with many layers of stacked neurons. Neural Networks based machine learning in which many layers of neurons are used is called “deep learning”.

The fundamental building blocks of neural networks are “Neurons” which are inspired from the neurons present in human brain. Here the main properties of neurons which we need to know are as follows:

  • Every neuron has multiple units and one outputs.
  • A neuron creates output by multiplying the inputs by some weights and pass the sum through a non-linear function called activation function.
  • In a stacked layers network (deep learning) apart from the input and output layers all the layers take from their left layer and pass the output to the right layer.
  • The input and output layers sit at the left-most and right-most side respectively.
  • We can keep the number of the neurons in the input and output layers equal to the conditionality of the input and output data.

Once we have a network and input and output data we can start training our model following an approach called back propagation which works in the following way:

  1. Set all the weights in the neural network to some random / guess values.

2. Pass the input data to the input layer and compute the output for the existing weights.

3. Find the mismatch between the actual output and predicted output — the loss.

4. Compute the gradients of the loss with respect to all the weights used in the network (you do not need to code for it since this is taken by machine learning libraries like Keras).

5. Once the gradients are computes we can update the weights and go the step (2) till convergence is not reached.

The common type of neurons do not have any memory and so for them it does not matter in what order we pass the data points to them. In fact in many cases we randomize the data points before passing to the neural networks.

In order to keep track of the order, for example in a sequential data, a very special type of neurons have been proposed which keep track of the ordering of data by employing a special variable called hidden state ‘h’ . The value of ‘h’ changes every time a data points passed through the unit. The neural networks based on these special type of neurons are called Recurrent Neural Network or (RNN) and they have wide range of applications in different ears of science & engineering, including text, speech and time series modelling.

Figure 1 : A RNN unit updates the memory state after every data point passes through it.

Note that there are two ways in which we present RNN units — one unrolled one, as is shown on the left of the Figure 1 and another, unrolled one as is shown on the right of the Figure 1. There is a maximum size of the sequence that control how many “time” steps we will have — how many time the memory state ‘h’ will be updated. We pass one token at a time to the RNN time and output of the 1st step is also included in the input for the second token after multiplying by the weight matrix ‘U’ and ‘W’.

The hidden state is updated in the following way:

h [t] = f ( W * X[t] + U * h[t-1] + b)

we can easily get output from the hidden state in the following way.

y’[t] = g (h[t])

In the above equation W & U are the weight matrices and ‘b’ is the bias. If the dimensionality of the input data ‘x’ is ’N’ and that of memory state is ’M’ then the dimensionality of ‘W’ will be M*N, ‘U’ will be M*M and that of ‘b’ will be M. Note that in the above equation ‘f’ and ‘g’ are non-linear activation function which generally are a ‘sigmoid’ or a ‘tanh’ funtctions.

The loss function ‘L’ depend on the actual output ‘y[t]’ and the predicted output ‘y’[t]’.

Loss = CE (y[t], y’[t])

Where Entropy or CE is a common loss function.

Note that every time we give an input the output changes and we need to keep track of all those changes in terms of the Gradients of the loss functions. The change in the loss function / outputs with respect to the weights need to be propagated for earlier and earlier tokens in the sequence. This is called Back-Propagation-Through-Time (BPTT). What happen in common RNN cells gradient becomes smaller and smaller for earlier and earlier tokens /times/data and gradients vanish. This stops RNN units to remember about the earlier and earlier tokens. In order to fix this problem a new type of RNN units were proposed, called Long-Short-Term-Memory (LSTM) units which control the flow of information by employing special kind of ‘gates’ as will discuss below.

Long-Short-Term-Memory (LSTM)

In order to address the issue of Vanishing gradient a new type of RNN cells were proposed, called Long-Short-Term-Memory (LSTM) cells which employs a set of gates to control the flow of information which makes sure that there is always some information which is prpogated from the fartest part of the sequence. LSTM units have one extra state ‘c’ called the cells state apart from the hidden state ‘h’ .

Figure 2: An LSTM Cell

In order to understand the working of the LSTM cells we need to define a set of variables.

X[t] : Input vector at time ‘t’ of dimensionality N

h[t] : Hidden state at ‘t’ of dimensionality M

c[t] : Cell state

f[t] : Forget gate

i[t] : Input gate

~C[t] : Cell gate

o[t]: Output gate

Note that all the four gates f, i, ~C, and o have two inputs X[t] and h[t-1] and have output in the range of [-1,1] for the ~C gate and in the range [0,1] for others. The states and the output are updated in the following way:

f[t] = sigmoid ( W_f * X[t] + U_f * h[t-1] + b_f)

i[t] = sigmoid (W_i * X[t] + U_i * h[t-1] * b_i)

~C[t] = tanh (W_c * X[t] + U_c * h[t-1] * b_c)

o[t] = sigmoid (W_o * X[t] + U_o * h[t-1]* b_o)

Once we have the gate we can update the cell state in the following way:

C[t] = ~C[t] * i[t] + f[t] * c[t-1]

h[t] = o[t] * tanh (C[t])

Note that here ‘*’ represent the pairwise multiplication.

The unit has many parameters which can be counted in the following ways:

4 W matrices = 4 * N * M

4 U matrices = 4 * M * M

4 b vectors = 4 * M

So total number of parameters = 4 *M (N + M + 1)

The main action of the LSTM unit is to keep updating the cells states and hidden state as tokens are passed to the unit.

We can either keep track of the every update or just track the cells and hidden state once the entire sequence has been passed.

Note that the LSTM unit takes ‘X’, a fixed dimensional vector and not a token so it is also necessary to vectorize the words before they are passed to the LSTM one by one. Thankfully, it is quite easy to achieve one extra layer called ‘Embedding Layer’ that does the job but has its own parameters which also get trained during the training. In the simple Sequence to sequence model we use the value of the hidden state (h) after the sequence has been passed as a reference for the context vector.

Sequence to Sequence Model :

This is a very simple model with two main components — one encoder and one decoder as shown in the following figure.

Figure 3: Sequence to Sequence Model

This model is trained on a set of sequences — pair on source and target language sentences for the case of translation. Before passing the input and output sequence to the model these are properly tokenized and vectorized (check the code at the end). We also pad the input sequence to make them of some fixed size and append the target sequence with two special tokens, one “__start__” representing the start of the sequence and one at the end “__end__” representing the end of the sentences. The training takes place in the following way:

In vectorized sequence of tokens is passed to the encoder with one token at a time and the cell or memory state is updated after every token. Once the input sequence is completed we used the cell state after the last token of the input sequence to initialise the memory state of the decoder and start producing the output sequence one token at a time from the decoder.

Note that for the first token we do not have any output for the decoder so we start with the start token and produce the next token using the memory state from the encoder for the last token of the input sequence. Once we produce the second token from the decoder we give that and update memory state for producing the next token. Note that when we produce the third token we do not use the produced second token as input but we use the actual second token as input and this process is called “teacher forcing”.

Once we have produced the full output sequence for one input sequence we go for the next sequence and the training continues till all the training data is used.

Once the training is completed the inference part comes for which we use pre-trained encoder and decoders.

Now for inference also we use decoder to produce the output tokens, since in this case we do not have the actual output tokens so no teacher forcig is done. The process is as follows:

Get the test sequence and tokenize and vectorise using the pre-trained embedding and get a cell state from the pre-trained encoder. Use the cell state of the encoder to produce tokens from the decoder one at a time and stop producing the tokens once got the ‘stop’ token or the length of the sequence has crossed a pre-set length.

At the end you are able to produce an output sequence which could be translation or answer to a question or anything else, for an input sequence.

Below is full implementation of the code in Keras/Tensorflow.

Note that the code is written in a modular and structural way so that it easy to follow. I generally use the following 4 components in any machine learning pipeline.

  1. Data reader
  2. Data pre-processor
  3. Training
  4. Inference

I avoid hard-coding numbers in the code and preferably use an ‘ini’ file. So I highly advice to familiarize yourself with argparse, configparse and ini files.

Code

Training notebook:

Firstly important all the required libraries. In the next block I has shown functions which are used to read the input data and build vocabulary. Once we are sure that data has been read and processed properly there is the actual model given in the next block and after that we have the main program. You can just copy & paste these programs in your notebook (make sure you have got the data and the path has been set properly).

import os
import sys
import shutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from keras.models import Model
from tensorflow import keras
from keras.layers import Input, LSTM, Dense
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
def get_data():
num_samples = 10000
# Vectorize the data.
data_path = r"C:\Users\jayanti.prasad\Data\NLP_DATA\seq2seq_data\fra.txt"
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()

with open(data_path, "r", encoding="utf-8") as f:
lines = f.read().split("\n")

for line in lines[: min(num_samples, len(lines) - 1)]:
raw_line = line.split("\t")
input_text, target_text = raw_line[0], raw_line[1]
# We use "tab" as the "start sequence" character
# for the targets, and "\n" as "end sequence" character.
target_text = "\t" + target_text + "\n"
input_texts.append(input_text)
target_texts.append(target_text)
for char in input_text:
if char not in input_characters:
input_characters.add(char)
for char in target_text:
if char not in target_characters:
target_characters.add(char)
return input_texts, target_texts, input_characters, target_characters


def get_params(input_texts, target_texts, input_characters, target_characters ):

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)

input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

df_config = pd.DataFrame(columns=['key','value'])
df_config['key'] = ['num_encoder_tokens', 'num_decoder_tokens', 'max_encoder_seq_length', 'max_decoder_seq_length']
df_config['value'] = [num_encoder_tokens, num_decoder_tokens, max_encoder_seq_length, max_decoder_seq_length]

df_inp_vocab = pd.DataFrame(columns=['token'])
df_inp_vocab['token'] = list (input_token_index.keys())

df_out_vocab = pd.DataFrame(columns=['token'])
df_out_vocab['token'] = list (target_token_index.keys())

return df_config, df_inp_vocab, df_out_vocab


class Seq2Seq_Train:
def __init__(self, workspace_dir, df_config):

self.workspace = workspace_dir
self.model_dir = self.workspace + os.sep + "trained_model"
self.log_dir = self.workspace + os.sep + "log"

if os.path.exists(self.log_dir) and os.path.isdir(self.log_dir):
shutil.rmtree(self.log_dir)

os.makedirs(self.workspace, exist_ok=True)
os.makedirs(self.model_dir, exist_ok=True)
os.makedirs(self.log_dir, exist_ok=True)

P = dict ( zip (df_config['key'].to_list(), df_config['value'].to_list()))

self.num_encoder_tokens = P['num_encoder_tokens']
self.num_decoder_tokens = P['num_decoder_tokens']
self.latent_dim = P['latent_dim']

self.max_encoder_seq_length = P['max_encoder_seq_length']
self.max_decoder_seq_length = P['max_decoder_seq_length']

self.build_model()

def build_model (self,):

# Define an input sequence and process it.
encoder_inputs = keras.Input(shape=(None, self.num_encoder_tokens),name='Encoder-Input')
encoder = keras.layers.LSTM(self.latent_dim, return_state=True,name='Encoder-LSTM')
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = keras.Input(shape=(None, self.num_decoder_tokens),name='Decoder-Input')

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = keras.layers.LSTM(self.latent_dim, return_sequences=True, return_state=True, name='Decoder-LSTM')
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = keras.layers.Dense(self.num_decoder_tokens, activation="softmax",name='Decoder-Dense')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
self.model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs,name='Sequence-to-Sequence-Model')


def fit_model (self, input_texts, target_texts, input_token_index, target_token_index, epochs, batch_size):

encoder_input_data = np.zeros((len(input_texts), self.max_encoder_seq_length, self.num_encoder_tokens), dtype="float32")
decoder_input_data = np.zeros((len(input_texts), self.max_decoder_seq_length, self.num_decoder_tokens), dtype="float32" )
decoder_target_data = np.zeros((len(input_texts), self.max_decoder_seq_length, self.num_decoder_tokens), dtype="float32")

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
for t, char in enumerate(input_text):
encoder_input_data[i, t, input_token_index[char]] = 1.0
encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
for t, char in enumerate(target_text):
# decoder_target_data is ahead of decoder_input_data by one timestep
decoder_input_data[i, t, target_token_index[char]] = 1.0
if t > 0:
# decoder_target_data will be ahead by one timestep
# and will not include the start character.
decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
decoder_target_data[i, t:, target_token_index[" "]] = 1.0

# Run training


chkpt = ModelCheckpoint(filepath=self.model_dir + os.sep + "model.hdf5",
save_weights_only=False, monitor='val_loss', mode='min', save_best_only=True)

tboard = TensorBoard(log_dir=self.log_dir)
callbacks = [chkpt, tboard]

optimizer = keras.optimizers.Adam(learning_rate=0.01)
self.model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=["accuracy"])
history = self.model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=batch_size, epochs=epochs, callbacks=callbacks,validation_split=0.2)
return history


if __name__ == "__main__":

workspace_dir = "tmp"
batch_size = 100 # Batch size for training.
epochs = 100 # Number of epochs to train for.
latent_dim = 60 # Latent dimensionality of the encoding space.

input_texts, target_texts, input_characters, target_characters = get_data ()


input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

df_config, df_inp_vocab, df_out_vocab = get_params(input_texts, target_texts, input_characters, target_characters )
df_config.loc[len(df_config)] = ['latent_dim',latent_dim]

df_inp_vocab.to_csv (workspace_dir + os.sep + "vocab_in.csv")
df_out_vocab.to_csv (workspace_dir + os.sep + "vocab_out.csv")
df_config.to_csv (workspace_dir + os.sep + "params.csv")

M = Seq2Seq_Train (workspace_dir, df_config)
print(M.model.summary())

hist = M.fit_model (input_texts, target_texts, input_token_index, target_token_index, epochs, batch_size)

fig, axs = plt.subplots (2,1,figsize=(10,8))
axs[0].plot(hist.history['loss'],label='Training Loss')
axs[0].plot(hist.history['val_loss'],label='Validation Loss')
axs[1].plot(hist.history['accuracy'],label='Training Accurcay')
axs[1].plot(hist.history['val_accuracy'],label='Validation Accuracy')
axs[0].legend()
axs[1].legend()
plt.show()

If Everything works fine then when you run the notebook you should get something like the following for the model architecture.

Figure 4: Model Architecture

The main program also produces plots for the loss and accuracy as shown below.

Figure 5: Loss
Figure 6: Accuracy

Note that the performance metrics (loss & accuracy) are not great since it was small model (trained on small data) and here the purpose was to explain the model. Once the training is done and model is checkpointed to a model directory and vocabulary data frames are written we can do the inference for which the code is as below.

Inference

The first block shows the inference module and the second main program for the same.


import os
import pandas as pd
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Input
import numpy as np

def get_encoder_decoder(model_path):
model = load_model(model_path)

encoder_inputs = model.input[0] # input_1

encoder_lstm = model.get_layer('Encoder-LSTM')
latent_dim = encoder_lstm.output_shape[-1][1]
encoder_outputs, state_h_enc, state_c_enc = encoder_lstm.output # lstm_1

encoder_states = [state_h_enc, state_c_enc]
encoder_model = Model(encoder_inputs, encoder_states,name='Encoder-Model')

decoder_inputs = model.input[1] # input_2
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_lstm = model.get_layer('Decoder-LSTM')
decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)

decoder_states = [state_h_dec, state_c_dec]
decoder_dense = model.get_layer('Decoder-Dense')
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model( [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states,name='Decoder-Model')
return encoder_model, decoder_model


class Decoder:
def __init__(self, encoder_model, decoder_model, input_token2id,target_token2id, P):

self.encoder_model = encoder_model
self.decoder_model = decoder_model
self.latent_dim = P['latent_dim']
self.num_decoder_tokens = P['num_decoder_tokens']
self.max_decoder_seq_length = P['max_decoder_seq_length']

self.input_token2id = input_token2id
self.target_token2id = target_token2id

#print("targte_token2id", self.target_token2id)

self.input_id2toekn = dict((i, char) for char, i in self.input_token2id.items())
self.target_id2token = dict((i, char) for char, i in self.target_token2id.items())

def decode_sequence(self, input_seq):
# Encode the input as state vectors.
states_value = self.encoder_model.predict(input_seq)

print("self.target_token_index",self.target_token2id["\t"])
# Generate empty target sequence of length 1.
target_seq = np.zeros((1, 1, self.num_decoder_tokens))
# Populate the first character of target sequence with the start character.
target_seq[0, 0, self.target_token2id["\t"]] = 1.0

# Sampling loop for a batch of sequences
# (to simplify, here we assume a batch of size 1).
stop_condition = False
decoded_sentence = ""
while not stop_condition:
output_tokens, h, c = self.decoder_model.predict([target_seq] + states_value)

# Sample a token
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_char = self.target_id2token[sampled_token_index]
decoded_sentence += sampled_char

# Exit condition: either hit max length
# or find stop character.
if sampled_char == "\n" or len(decoded_sentence) > self.max_decoder_seq_length:
stop_condition = True

# Update the target sequence (of length 1).
target_seq = np.zeros((1, 1, self.num_decoder_tokens))
target_seq[0, 0, sampled_token_index] = 1.0

# Update states
states_value = [h, c]
return decoded_sentence
if __name__ == "__main__":

workspace_dir= r"C:\Users\jayanti.prasad\Projects-Dev\Seq2Seq\seq2seq_char1\tmp"
model_path = r"C:\Users\jayanti.prasad\Projects-Dev\Seq2Seq\seq2seq_char1\tmp\trained_model\model.hdf5"

df_config = pd.read_csv(workspace_dir + os.sep + "params.csv")
df_vocab_in = pd.read_csv(workspace_dir + os.sep + "vocab_in.csv",encoding='utf-8')
df_vocab_out = pd.read_csv(workspace_dir + os.sep + "vocab_out.csv",encoding='utf-8')


P = dict (zip (df_config['key'].to_list(), df_config['value'].to_list()))

inp_tokens = df_vocab_in['token'].to_list()
out_tokens = df_vocab_out['token'].to_list()

input_token2id = dict (zip (inp_tokens, [i for i in range (0, len (inp_tokens))]))
target_token2id = dict (zip (out_tokens, [i for i in range (0, len (out_tokens))]))

encoder_model, decoder_model = get_encoder_decoder(model_path)

print(encoder_model.summary())
print(decoder_model.summary())

# get some input data for decoding
data_path = r"C:\Users\jayanti.prasad\Data\NLP_DATA\seq2seq_data\fra.txt"
with open(data_path, "r", encoding="utf-8") as f:
lines = f.read().split("\n")

input_texts = []
target_texts = []
num_samples = 2000

for line in lines[: min(num_samples, len(lines) - 1)]:
raw_line = line.split("\t")
#input_text, target_text, _ = line.split("\t")
input_text, target_text = raw_line[0], raw_line[1]
target_text = "\t" + target_text + "\n"
input_texts.append(input_text)
target_texts.append(target_text)

encoder_input_data = np.zeros((len(input_texts), P['max_encoder_seq_length'], P['num_encoder_tokens']), dtype="float32")

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
for t, char in enumerate(input_text):
encoder_input_data[i, t, input_token_index[char]] = 1.0
encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0

D = Decoder (encoder_model, decoder_model, input_token2id,target_token2id, P)

for seq_index in range(20):
# Take one sequence (part of the training set)
# for trying out decoding.
input_seq = encoder_input_data[seq_index : seq_index + 1]
decoded_sentence = D.decode_sequence(input_seq)
print("-")
print("Input sentence:", input_texts[seq_index])
print("Decoded sentence:", decoded_sentence)

Note that here the model was presented for sequences of characters and not of words, which is not very practical, however, I will link the codes for the sequence to sequence of words also in future.

Please like, comment and share this article if you have found it useful and keep checking since the article will be updated.

References

1.Sepp Hochreiter and Jürgen Schmidhuber (1995), Long-Short-Term Memory

2. Sepp Hochreiter and Jürgen Schmidhuber (1997), Long-Short-Term Memory

3.Daniel Jurafsky & James H. Martin (2023), Speech and Language Processing [Chapter 7: Neural Networks and Neural Language Models]

4.Daniel Jurafsky & James H. Martin (2023), Speech and Language Processing [Chapter 8: Sequence Labeling for Parts of Speech and Named Entities]

5.Daniel Jurafsky & James H. Martin (2023), Speech and Language Processing [Chapter 9: RNN and LSTM]

6.A ten-minute introduction to sequence-to-sequence learning in Keras

7.Understanding LSTM Networks

--

--

Jayanti prasad Ph.D
Jayanti prasad Ph.D

Written by Jayanti prasad Ph.D

Physicist, Data Scientist and Blogger.

No responses yet