In this post we will be learning about how to translate texts from one language to another using Neural Machine Translation with Attention using python (Keras and eager execution)

Lets start with some Theory,

So,

What is Neural Machine Translation?

A neural machine translation system is a neural network that directly models the conditional probability p(y|x) of translating a source sentence, x1, . . . , xn, to a target sentence, y1, . . . , ym.3

a stacking recurrent architecture for translating a source sequence A B C D into a target sequence X Y Z. Here, marks the end of a sentence

It starts emitting one target word at a time, as illustrated in Figure above . NMT is often a large neural network that is trained in an end-to-end fashion and has the ability to generalize well to very long word sequences. This means the model does not have to explicitly store gigantic phrase tables and language models as in the case of standard MT; hence, NMT has a small memory footprint.

Attention model

Attention-based models are classified into two broad categories, global and local. Common to these two types of models is the fact that at each time step t in the decoding phase, both approaches ﬁrst take as input the hidden state ht at the top layer of a stacking LSTM. The goal is then to derive a context vector ct that captures rel- evant source-side information to help predict the current target word yt. While these models differ in how the context vector ct is derived, they share the same subsequent steps.

1 Global Attention

The idea of a global attentional model is to consider all the hidden states of the encoder when deriving the context vector ct.

2 Local attention model

The global attention has a drawback that it has to attend to all words on the source side for each target word, which is expensive and can potentially render it impractical to translate longer sequences, e.g., paragraphs or documents. To address this deﬁciency, we propose a local attentional mechanism that chooses to focus only on a small subset of the source positions per target word. To address this deﬁciency, we propose a local attentional mechanism that chooses to focus only on a small subset of the source positions per target word.

Prerequisites -

Python installed
Python libraries-(Numpy, TensorFlow, SkLearn) installed. (for step by step help, go to the section- ‘Setting up the environment’ of this post)

So lets start with the code,

1- Importing important libraries

from __future__ import absolute_import, division, print_function

# Import TensorFlow >= 1.10 and enable eager execution
import tensorflow as tf

tf.enable_eager_execution()

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import time

print(tf.__version__)#to check the tensorflow version

2- Download and prepare the dataset

We’ll use a language dataset provided by http://www.manythings.org/anki/. This dataset contains language translation pairs in the format:

May I borrow this book? ¿Puedo tomar prestado este libro?

There are a variety of languages available, but we’ll use the English-Spanish dataset. you can also download your own copy. After downloading the dataset, here are the steps we’ll take to prepare the data:

Add a start and end token to each sentence.
Clean the sentences by removing special characters.
Create a word index and reverse word index (dictionaries mapping from word → id and id → word).
Pad each sentence to a maximum length.

# Download the file
path_to_zip = tf.keras.utils.get_file(
‘spa-eng.zip’, origin=’http://download.tensorflow.org/data/spa-eng.zip', extract=True)

path_to_file = os.path.dirname(path_to_zip)+”/spa-eng/spa.txt”

3- Some preprocessing on dataset

Preprocessing includes

Converting the unicode file to ascii
Creating a space between a word and the punctuation following it
eg: “he is a boy.” => “he is a boy .” Reference
Replacing everything with space except (a-z, A-Z, “.”, “?”, “!”, “,”)
Adding a start and an end token to the sentence so that the model know when to start and stop predicting.
Removing the accents
Cleaning the sentences
Return word pairs in the format: [ENGLISH, SPANISH]
Creating a word -> index mapping (e.g,. “dad” -> 5) and vice-versa. (e.g., 5 -> “dad”) for each language.

(Rest everything is written in comments of code for better understanding)

`# Converts the unicode file to ascii`  
**def** unicode\_to\_ascii(s):  
  **return** ‘’.join(c **for** c **in** unicodedata.normalize(‘NFD’, s)  
  **if** unicodedata.category(c) != ‘Mn’)

def preprocess_sentence(w):
w = unicode_to_ascii(w.lower().strip())
`# creating a space between a word and the punctuation following it

eg: "he is a boy." => "he is a boy ."`

w = re.sub(r”([?.!,¿])”, r” \1 “, w)
w = re.sub(r’[“ “]+’, “ “, w)
# replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
w = re.sub(r”[^a-zA-Z?.!,¿]+”, “ “, w)
w = w.rstrip().strip()
`# adding a start and an end token to the sentence

so that the model know when to start and stop predicting.`

w = ‘ ‘ + w + ‘ ’
return w
`# 1. Remove the accents

2. Clean the sentences

3. Return word pairs in the format: [ENGLISH, SPANISH]`

def create_dataset(path, num_examples):
lines = open(path, encoding='UTF-8').read().strip().split('\n')
word_pairs = [[preprocess_sentence(w) for w in l.split('\t')] for l in lines[:num_examples]]
return word_pairs

# This class creates a word -> index mapping (e.g,. "dad" -> 5) and vice-versa  
# (e.g., 5 -> "dad") for each language,

class LanguageIndex():
def __init__(self, lang):
self.lang = lang
self.word2idx = {}
self.idx2word = {}
self.vocab = set()

self.create_index()

def create_index(self):
for phrase in self.lang:
self.vocab.update(phrase.split(' '))

self.vocab = sorted(self.vocab)

self.word2idx[''] = 0
for index, word in enumerate(self.vocab):
self.word2idx[word] = index + 1

for word, index in self.word2idx.items():
self.idx2word[index] = word

def max_length(tensor):
return max(len(t) for t in tensor)

def load_dataset(path, num_examples):

creating cleaned input, output pairs

pairs = create_dataset(path, num_examples)

# index language using the class defined above
inp_lang = LanguageIndex(sp for en, sp in pairs)
targ_lang = LanguageIndex(en for en, sp in pairs)

Vectorize the input and target languages

Spanish sentences

input_tensor = [[inp_lang.word2idx[s] for s in sp.split(' ')] for en, sp in pairs]

English sentences

target_tensor = [[targ_lang.word2idx[s] for s in en.split(' ')] for en, sp in pairs]

Calculate max_length of input and output tensor

Here, we'll set those to the longest sentence in the dataset

max_length_inp, max_length_tar = max_length(input_tensor), max_length(target_tensor)

Padding the input and output tensor to the maximum length

input_tensor = tf.keras.preprocessing.sequence.pad_sequences(input_tensor, maxlen=max_length_inp,padding='post')

target_tensor = tf.keras.preprocessing.sequence.pad_sequences(target_tensor,maxlen=max_length_tar,padding='post')

return input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_tar

4- Getting vectors from load dataset function

Try experimenting with the size of that dataset

num_examples = 30000
input_tensor, target_tensor, inp_lang, targ_lang, max_length_inp, max_length_targ = load_dataset(path_to_file, num_examples)

5- Splitting dataset into train test

input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

6- PreDefining some Values

BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
N_BATCH = BUFFER_SIZE//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word2idx)
vocab_tar_size = len(targ_lang.word2idx)

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

7- Encoder and Decoder model

Here, we’ll implement an encoder-decoder model with attention which you can read about in the TensorFlow Neural Machine Translation (seq2seq) tutorial. The following diagram shows that each input word is assigned a weight by the attention mechanism which is then used by the decoder to predict the next word in the sentence.

The input is put through an encoder model which gives us the encoder output of shape (batch_size, max_length, hidden_size) and the encoder hidden state of shape (batch_size, hidden_size).

The shapes of all the vectors at each step have been specified in the comments in the code:

def gru(units):

the code automatically does that.

if tf.test.is_gpu_available():
return tf.keras.layers.CuDNNGRU(units,
return_sequences=True,
return_state=True,
recurrent_initializer=’glorot_uniform’)
else:
return tf.keras.layers.GRU(units,
return_sequences=True,
return_state=True,
recurrent_activation=’sigmoid’,
recurrent_initializer=’glorot_uniform’)

class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = gru(self.enc_units)

def call(self, x, hidden):
x = self.embedding(x)
output, state = self.gru(x, initial_state = hidden)
return output, state

def initialize_hidden_state(self):
return tf.zeros((self.batch_sz, self.enc_units))

class Decoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
super(Decoder, self).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = gru(self.dec_units)
self.fc = tf.keras.layers.Dense(vocab_size)

used for attention

self.W1 = tf.keras.layers.Dense(self.dec_units)
self.W2 = tf.keras.layers.Dense(self.dec_units)
self.V = tf.keras.layers.Dense(1)

def call(self, x, hidden, enc_output):

enc_output shape == (batch_size, max_length, hidden_size)

hidden shape == (batch_size, hidden size)

hidden_with_time_axis shape == (batch_size, 1, hidden size)

we are doing this to perform addition to calculate the score

hidden_with_time_axis = tf.expand_dims(hidden, 1)

score shape == (batch_size, max_length, 1)

we get 1 at the last axis because we are applying tanh(FC(EO) + FC(H)) to self.V

score = self.V(tf.nn.tanh(self.W1(enc_output) + self.W2(hidden_with_time_axis)))

attention_weights shape == (batch_size, max_length, 1)

attention_weights = tf.nn.softmax(score, axis=1)

context_vector shape after sum == (batch_size, hidden_size)

context_vector = attention_weights * enc_output
context_vector = tf.reduce_sum(context_vector, axis=1)

x shape after passing through embedding == (batch_size, 1, embedding_dim)

x = self.embedding(x)

x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)

x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

passing the concatenated vector to the GRU

output, state = self.gru(x)

output shape == (batch_size * 1, hidden_size)

output = tf.reshape(output, (-1, output.shape[2]))

output shape == (batch_size * 1, vocab)

x = self.fc(output)

return x, state, attention_weights

def initialize_hidden_state(self):
return tf.zeros((self.batch_sz, self.dec_units))

encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

8- Defining Optimizer, Loss Function and Checkpoints

optimizer = tf.train.AdamOptimizer()

def loss_function(real, pred):
mask = 1 — np.equal(real, 0)
loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask
return tf.reduce_mean(loss_)

checkpoint_dir = ‘./training_checkpoints’
checkpoint_prefix = os.path.join(checkpoint_dir, “ckpt”)
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
encoder=encoder,
decoder=decoder)

9- Let’s start the training

Pass the input through the encoder which return encoder output and the encoder hidden state.
The encoder output, encoder hidden state and the decoder input (which is the start token) is passed to the decoder.
The decoder returns the predictions and the decoder hidden state.
The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.
Use teacher forcing to decide the next input to the decoder.
Teacher forcing is the technique where the target word is passed as the next input to the decoder.
The final step is to calculate the gradients and apply it to the optimizer and backpropagate.

EPOCHS = 10

for epoch in range(EPOCHS):
start = time.time()

hidden = encoder.initialize_hidden_state()
total_loss = 0

for (batch, (inp, targ)) in enumerate(dataset):
loss = 0

with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(inp, hidden)

dec_hidden = enc_hidden

dec_input = tf.expand_dims([targ_lang.word2idx[‘’]] * BATCH_SIZE, 1)

Teacher forcing — feeding the target as the next input

for t in range(1, targ.shape[1]):

passing enc_output to the decoder

predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

loss += loss_function(targ[:, t], predictions)

using teacher forcing

dec_input = tf.expand_dims(targ[:, t], 1)

batch_loss = (loss / int(targ.shape[1]))

total_loss += batch_loss

variables = encoder.variables + decoder.variables

gradients = tape.gradient(loss, variables)

optimizer.apply_gradients(zip(gradients, variables))

if batch % 100 == 0:
print(‘Epoch {} Batch {} Loss {:.4f}’.format(epoch + 1,
batch,
batch_loss.numpy()))

saving (checkpoint) the model every 2 epochs

if (epoch + 1) % 2 == 0:
checkpoint.save(file_prefix = checkpoint_prefix)

print(‘Epoch {} Loss {:.4f}’.format(epoch + 1,
total_loss / N_BATCH))
print(‘Time taken for 1 epoch {} sec\n’.format(time.time() — start))

10- Now Let’s make a prediction function which gives actual output

The evaluate function is similar to the training loop, except we don’t use teacher forcing here. The input to the decoder at each time step is its previous predictions along with the hidden state and the encoder output.

def evaluate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
attention_plot = np.zeros((max_length_targ, max_length_inp))

sentence = preprocess_sentence(sentence)

inputs = [inp_lang.word2idx[i] for i in sentence.split(‘ ‘)]
inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=max_length_inp, padding=’post’)
inputs = tf.convert_to_tensor(inputs)

result = ‘’

hidden = [tf.zeros((1, units))]
enc_out, enc_hidden = encoder(inputs, hidden)

dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word2idx[‘’]], 0)

for t in range(max_length_targ):
predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)

storing the attention weights to plot later on

attention_weights = tf.reshape(attention_weights, (-1, ))
attention_plot[t] = attention_weights.numpy()

predicted_id = tf.argmax(predictions[0]).numpy()

result += targ_lang.idx2word[predicted_id] + ‘ ‘

if targ_lang.idx2word[predicted_id] == ‘’:
return result, sentence, attention_plot

the predicted ID is fed back into the model

dec_input = tf.expand_dims([predicted_id], 0)

return result, sentence, attention_plot

11- Make a function for plotting attention weights and then make an connecting function to connect plotting and prediction

# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1, 1, 1)
ax.matshow(attention, cmap=’viridis’)

fontdict = {‘fontsize’: 14}

ax.set_xticklabels([‘’] + sentence, fontdict=fontdict, rotation=90)
ax.set_yticklabels([‘’] + predicted_sentence, fontdict=fontdict)

plt.show()

def translate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
result, sentence, attention_plot = evaluate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)

print(‘Input: {}’.format(sentence))
print(‘Predicted translation: {}’.format(result))

attention_plot = attention_plot[:len(result.split(‘ ‘)), :len(sentence.split(‘ ‘))]
plot_attention(attention_plot, sentence.split(‘ ‘), result.split(‘ ‘))

12- Restore checkpoints

# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

13- NOW THE FINAL TASK — PREDICTION!!!

As you can see we input some text in spanish and our model outputs with its english translation and attention weights plot

Research Paper -

Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong, Hieu Pham, Christopher D. Manning

Congrats! You’ve made it

do write comments

Have fun, keep learning, and always keep coding

My LinkedIn & Twitter

Implementation of Neural machine translation using python