Automated Response for Email using Deep Learning

Pratik Sen
24 min readDec 16, 2020
Photo by Solen Feyissa on Unsplash

Table Of Contents

  1. Problem Statement
  2. Problem Description
  3. Research-Paper
  4. Data Preparation and Overview
  5. Mapping the problem into Deep Learning problem
  6. Business objectives and constraints
  7. Performance metric for Deep Learning
  8. Data Preprocessing & EDA
  9. Encoder Decoder based Sequence to Sequence model(This section contains explanation and implementation of the Train and Inference models)
  10. Attention Model (This section implements Attention model and shows how we can improve the existing Seq2Seq model)
  11. Evaluate the Model’s predicted answers using BLEU Score
  12. Conclusion
  13. Future Works
  14. My GitHub and LinkedIn
  15. References

1. Problem Statement:

Create an end-to-end method for automatically generating short email responses, called Smart Reply.

2. Problem Description:

Email is one of the most popular modes of communication on the Web . With the rapid increase in email overload, it has become increasingly challenging for users to process and respond to incoming messages. It can be especially time consuming to type email replies on a mobile device. Therefore the ML model can help by automatically suggesting replies for an incoming email. This problem description is taken from the below mentioned research paper.

3. Research-Paper:

https://www.kdd.org/kdd2016/papers/files/Paper_1069.pdf

4. Data Preparation and Overview

The data for this case study is the Enron Email corpus dataset. Link → https://www.cs.cmu.edu/~./enron/

  • The data is present as a compressed file of about 1.7 GB
  • Unzipping the data, would show that this data contains email information of 150 users
  • There are separate folders for all users which looks like these:
  • There are a total of 150 folders like these.
  • Inside each of these User folders, there are subfolders like Sent mails, Inbox, Documents etc. We have to go inside the Inbox subfolder where we have plenty of email threads. Each thread contains mails and their responses.
  • From these email threads, we make the Question-Answer pairs. E.g- the first email of the thread becomes the first question, the 2nd email of this thread would act like the answer of this question(because the 2nd email is the reply of the 1st email). So we got our first Q-A pair. Similarly, to get the next Q-A pair — the 2nd email in the thread becomes the Question and the reply to this (3rd email of the thread) becomes the answer . Same story goes with the rest of the emails in the thread. Also, there are some email threads with only one email in it. We need to discard such threads. We need to focus on only those threads which contain multiple mails and responses. Here’s a snapshot of a thread which shows how to Identify the Q-A pairs.
  • As shown in the above picture, we have to write the code to extract only the marked texts (in Purple brackets )from the entire document
  • I will just explain in short what logic I used in deriving the useful texts (Questions-Answers) from these documents. First I searched which is the right Keyword that I can use to separate the emails. I found that “ — Original Message — ” keyword is the best one. So I wrote a python code to split by this keyword. Once I had each of these messages (which also contains unnecessary texts like “From:” , “Sent:” , “to:” etc.) , I again performed Split on each one of them by the keyword “Subject:”. This gave me the exact message in that email. Once I got all the clean messages, I required to run the loop in reverse and form the Q-A pairs.
  • After extracting Question-Answer pairs from all the email threads for all the users, we finally save these Q-A pairs in an excel file. We get approximately 12K Q-A pairs.

5. Mapping the problem into Deep Learning problem:

We can take this dataset of QA pairs, clean the data and perform data pre-processing techniques, tokenize the questions and answers using the Keras Tokenizer API and make the data ready for the Encoder-Decoder Seq2Seq architecture. Later, we will also add an attention Layer to improve the performance of the Encoder Decoder model.

6. Business objectives and constraints:

A. The objective of this Case Study is to build a Deep Learning model which trains on the Q-A pairs and becomes intelligent enough to predict suitable replies to incoming emails. This has to be a Sequence to Sequence based Deep Learning model.

B. The system should have a low latency (approx a few milliseconds to 1 second)

7. Performance metric for Deep Learning:

The metric used in the compile() method is the Sparse_Categorical_crossentropy.
Later, during Post model Evaluation, we will use BLEU score to evaluate the quality of the predicted answer.

8. Data Preprocessing & EDA:

EDA will be used to help in data preprocessing. Data preprocessing will be used to make the Q-A pairs suitable for the Encoder decoder model.

8.1 - First read the csv file that we generated after extracting data from the original source.

Here Sentence_1 will be the Question column and Sentence_2 will be Answer column.

8.2 - Then drop those rows where either Sentence_1 or Sentence_2 is NaN

8.3 - Find if a sentence contains “.” or “?”

Find if a sentence contains “.” or “?” . Then we will limit the sentence_1 and Sentence_2 until the first occurence of “.” or “?”. We have to do this because many Questions and Answers are huge and this shall increase the length of Encoder-Decoder inputs. Here’s the code snippet.

8.4 - Perform text-preprocessing

Perform text-preprocessing like replace multiple spaces with single space, substitute “won’t” with “will not”, “can’t” with “cannot” etc.
Later, we will add space to “.” and “?” and while using the tokenizer api, we will exclude these two from the filter. Therefore both “.” and “?” will also be treated as tokens.

8.5 - Next we will check what is the size of Question and answers.

To check this we will do a Scatter plot with X -axis as the Index of the Question and Y-axis as the Question Length. We will do same for analyzing Answer length.

Analyze the Length of Questions
Analyze the Length of Answers

From the above 2 plots, we see that the Majority of the Questions and Answers have a length less than 1000. So in the next step, we will limit their length to 1000. There are some handful of Questions and Answers that have a length 10000 and 3000, and it’s better to remove them as the padding length would increase dramatically.

8.6 - Insert start of Sentence <Prarambh> and End of sentence <Samaapt> to the answer.

This will help the model to understand the start and end of a sentence. Initialize the Tokenizer with filters. Writing filters will inform the tokenizer to filter out unnecessary symbols that we do not want the model to train on. But we will exclude “.” and “?” as we want the model to train on these two symbols. Additionally, keep the Vocabulary size as a Global variable. The Vacabulary Size that we get is 13521 words.

8.7 - Let’s do some analysis on the frequencies of each word-

To do this we make use of “tokenizer.word_docs”. This will give a dictionary showing the frequency of each word e.g — “This”:21 , “book”:209, “we”:21 .
Now write a program to swap the key value pairs, so that all the words having the same frequency get aggregated.

The new dictionary that we get after swapping, will look like:

{1:[“publish”, “conclusion”, “abiding”……] , 2: [“wishful”, “thought”,….. ], …….}

In this dictionary, the words that have frequency =1 are all present in the list that is mapped to the key — 1. On calculating the length of the list, I found that there are 4458 words with a frequency of 1. We have a choice to mark these words as OOV (Out of Vocabulary) words. The tokenizer will assign a specific token to OOV words and this would help to build a more robust model to deal with unknown words during Test phase.

8.8 - Let’s define encoder_input_data , decoder_input_data , decoder_output_data for the Train model.

8.8.1 -Creation of encoder_input_data

Convert the Question to sequence using Tokenizer and do Post padding on the sequence uptil the max length of the question.

The printed shapes are →(11796, 154) 154

The model is trained on a given source and target sequence where the model takes both the source and a shifted version of the target sequence as input and predicts the whole target sequence. For example, one source sequence may be [1,2,3] and the target sequence [4,5,6].
Then, The inputs and outputs to the model during training would be:

Input1: [‘1’, ‘2’, ‘3’] — This resembles Question to the model

Input2: [‘prarambh’, ‘4’, ‘5’] — This resembles the decoder_input_data

Output: [‘4’, ‘5’, ‘6’, ‘samaapt’] — This resembles decoder_output_data

8.8.2 -Create decoder_input_data

The printed shapes are: (11796, 164) 164

8.8.3 -Create decoder_output_data

The printed shapes are: (11796, 164) 164

9. Encoder Decoder based Sequence to Sequence model

Now lets define 3 models. Train model, Inference Encoder model, Inference Decoder Model

9.1 Train Model:

→ 9.1.1 -First let’s define the architecture of Train Encoder.

We declare what should be the input shape, Embedding layer details and Encoder LSTM details.

As can be seen, the training encoder takes input of shape (? , 154) . The “?” here refers the batch size which is passed during model.fit() operation, that’s why we leave it as unknown “?” here.

Note that the encoder input of shape (?, 154) is not directly passed as the LSTM input. First, these inputs have to be converted to 3D. To convert into 3D , we could have either one-hot encoded the tokens in the input sequence or we can pass it through an Embedding layer. The output of the Embedding layer is of shape (?,154,64); where 64 is the Number of Embedding nodes. Now, (?,154,64) is passed as input to the encoder LSTM

There are 100 LSTM cells used in the Encoder architecture. These 100 LSTM cells together form 1 LSTM layer. This LSTM layer will unroll itself for each input timestep. The amount of unrolling will inturn depend upon the input dimensions. In this case, each input sequence is of shape (?,154). Therefore, the LSTM layer will unroll 154 times in time. For keeping explanation easy, let’s suppose that the batch size is 1. This means that the first LSTM layer will take the 1st input and the last LSTM layer will take the last input(154th input).

In the Encoder LSTM declaration, the return_state=True . Therefore, the LSTM will return three outputs. The 1st and the 2nd output will be same as the final hidden state- State_h of shape (?, 100) . The 3rd output is the final Cell state- State_c which is also of shape (?,100).

final State_h and final State_C are used as initial states for the LSTM decoder

→ 9.1.2 -Now lets go to the Training decoder architecture.

The decoder LSTM during the training phase also unroll for each timestep. The decoder input of shape (?, 164), therefore the LSTM layer unrolls 164 times. Note that the decoder input of shape (?, 164) is not directly passed as the LSTM input. First, these inputs have to be converted to 3D. To convert into 3D , we could have either one-hot ended the tokens in the input sequence or we can pass it through an Embedding layer. The output of the Embedding layer is of shape (?,164,64); where 64 is the Number of Embedding nodes. Now, (?,164,64) is passed as input to the decoder LSTM

The decoder LSTM initial states (State_h and State_c) is initialized as the Encoder’s final State_h and Final State_c

The Decoder LSTM return_state=True and return_sequences=True.

Therefore , there are 3 outputs from the LSTM. 1st is the Output sequence of shape (?,164,100); The 2nd output is the final hidden state- State_h of shape (?, 100) .The 3rd output is the final Cell state- State_c which is also of shape (?,100).

Note that the 2nd and 3rd outputs are not of any use in the later part of the model. The Output sequence of shape (?,164,100) is passed as input to the final Softmax layer . The output from the Softmax layer is of shape (?, 164, Vocab Size). If the Batch size =1, then this will be (1, 164, Vocab Size).

9.2 Inference Models:

This will be part of the Encoder-decoder architecture during the Testing phase. During the inference/ test phase, the Inference Encoder LSTM should unroll/unfold for each timestep but the Inference Decoder LSTM should not unroll . Reason — The Decoder input is of shape(1,1) which looks like [ [3] ] ; where 3 is the token for “prarambh”. Hence there is only one input token because of which the unroll is not required.

The Inference Encoder LSTM definition is the same as the Encoder LSTM defined during the training phase. Only difference here is that we define a separate model for Encoder called the Inference Encoder model.

Ques: How does the Inference Encoder model work ?
Ans: The model takes the Encoder Inputs and gives encoder_states as Output. The shape of the output is (?, 100).

Ques: How does the Inference Decoder model work ?
Ans: The Inference Decoder LSTM needs Initial State_h and initial State_c of shape (1, 100) and (1, 100) respectively. The Decoder receives these from the Encoder’s 2nd output-”encoder_states”. There are 3 outputs from the Decoder. 1st is the Decoder Output sequence of shape(1, 1, 100) . 2nd is final hidden state — State_h of shape (1, 100) and 3rd is final Cell State — State_c of shape (1, 100) . Decoder Output sequence of shape(1, 1, 100) is passed as input to the final Softmax layer. The output from the final Softmax layer is of shape(1,1,Vocab_Size).

Now we need to create a Model of the above defined Inference Decoder architecture.

Just highlighting the part from the above Gist

To write this code, we need to understand what will be the input and output to the Model.

Remember that the Decoder Inference model has to be called again and again in a for loop so that in each iteration, a word is predicted at the final Softmax layer.

In each iteration of the for loop, the inputs to the Model can be :

1. Decoder initial input (1,1)

2. [ Decoder State_h, Decoder State_c ]

What must be the Output from the Inference Decoder Model in each iteration of the for loop?

The answer to this is: Think in terms of, when the next iteration of for loop runs, then what are the values that should be passed as input to the Inference Decoder model ?

This means that:

1. The decoder initial input has to be derived from the Output from the final Softmax layer. Which means that the final Softmax Output has the be the 1st Output from the Model.

2. The Decoder State_h has to be updated for the next iteration. So whatever is the State_h generated from the current iteration, becomes the input State_h for the next iteration. So the State_h has to be the 2nd Output.

3. The Decoder State_C has to be updated for the next iteration. So whatever is the State_C generated from the current iteration, becomes the input State_C for the next iteration. Both the new states are passed together as [State_h, State_c].

9.3. Instantiate the 3 models and Compile() the train model

Here’s the Train Model Summary

For the embedding layer, the number of trainable params is 13521 * 64 = 865344

For the dense layer, the number of trainable params is : ( 13521 * 100 ) + 13521 = 1365621

9.4 Let’s fit the train model and save all the 3 models

Now let’s start the training for the train model for epochs=100, batch_size=64 and also provide Early stopping with monitor=’val_loss’ and patience=5.

After training is complete, save the train model, Inference Encoder, Inference Decoder model. During the testing phase , we will just load the saved Inference Encoder and Inference Decoder models

Let’s look at the plots for all the 3 models

Graph for the train model
Graph for the Inference Encoder model
Graph for the Inference Decoder

9.5 Now let’s create TFLite models of the Inference Encoder and Inference Decoder

tf-nightly pip package consists of latest tensorflow version. This package is updated once every night and contains latest features ,bug fixes and improvements over the last stable tensorflow release. For more info, refer this stackoverflow answer which explains the difference between tf-nightly and tensorflow.
The following snapshot is for Inference Encoder. We will do the same for the Inference Decoder model.

TFLite will create highly optimizer versions of these models and create a .tflite file . This file can be loaded into Edge Devices like (Raspberry Pi , Mobile phone) to do the prediction.

This will print the TFLite inference encoder model’s size- File size: 0.969 MB. Before conversion to TFLite, the model’s size was 4 MB.
Similarly the TFLite inference decoder model’s size is just 4MB. But before conversion, it was 9 MB.

Let’s see the Input-Output tensor shapes of encoder

Output from the above code:

Its important to keep a track of the shapes and type of the different input and output. E.g- the Inference encoder expects the first input to be of shape (1,154) and the type as numpy.float32
Conceptually, the first input of the inference encoder is the user question. This means the question has to be first converted into tokens using the Keras tokenizer and then padded to the maximum length of the question. To do this, a separate function has been defined.

Let’s see the decoder’s input and output tensor shapes

Output of the above cell:

9.6 Define a helper function:

This helper function will decode the aggregated output from the decoder. This function will be called inside the main function- “Predict_answer()”

9.7 Now the main function — predict_answer() is defined:

This function , first initializes the TFLite interpreter for the Encoder and Decoder models and gets their input and output shapes. After that “str_to_tokens()” function is used to convert the User Question to tokens of size (1,154) and then the type is changed to numpy.float32 because the encoder input expects the type to be numpy.float32
Now, Allocate tensor and set the tensor to the input numpy array.
Invoke the tensor and use the interpreter.get_tensor() to make the prediction. Remember that we printed the input and output details for the inference encoder. In the output , there are two nodes which give output of shape (1, 100) and (1, 100) respectively. Therefore, interpreter.get_tensor() has to be called twice to get both the predictions. These 2 predictions will serve as the initial states for the Inference decoder

The Inference decoder model does prediction word by word. This means that the inference decoder has to be called iteratively in a for loop. Notice that the interpreter_2.set_tensor() has been called 3 times because as we saw from the cell which printed the input and output details for the inference decoder, there has to be 3 inputs of shape (1, 100), (1, 100) and (1, 1).
There will be 3 predictions. The 1st will be of shape (1,1,VOCAB_SIZE). From this we will have to get the predicted token by doing argmax. This new token will serve as the 3rd input in the next iteration.
The 2nd and 3rd predictions will be used as the new hidden and cell states (1st and 2nd input )for the next iteration.

9.8 Let’s call the predict_answer function and do predictions on a User Question.

Here is the output:

As we can see, the predicted answers are not good enough. The model is not even able to form English sentences. In the next section, let’s add attention layer and check if we can improve the quality of predicted answers

10. Attention Model

For adding attention to the existing Encoder-Decoder model, I have taken reference from the blog by Thushan Ganegedara . The link to his blog on Attention — https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39

Image taken from Thushan Ganegedara’s blog on Attention

This architecture preserves the functioning of the General Seq2Seq Encoder Decoder architecture and attaches attention mechanism on top of it.

But before we proceed, save the attention.py file in the Google drive and load it in the notebook

Now we can directly use the AttentionLayer Class defined in this attention.py file

10.1 Let’s define the Train and Inference models.

Let’s understand each of the models defined under define_model() function

10.1.1 -Encoder used in training.

Snippet from the above define_model function

The first parameter to the define_model function -“n_input ” is the maximum length of Question(i.e-154) and 2nd parameter -“n_output” is the maximum length of Answer (i.e -164). The 3rd parameter is the number of LSTM units.

The the training encoder takes input of shape (? , 154) . The “?” here refers the batch size which is passed during model.fit() operation, that’s why we leave it as unknown “?” here.

Note that the encoder input of shape (?, 154) is not directly passed as the LSTM input. First, these inputs have to be converted to 3D. To convert into 3D , we could have either one-hot encoded the tokens in the input sequence or we can pass it through an Embedding layer. The output of the Embedding layer is of shape (?,154,64); where 64 is the Number of Embedding nodes. Now, (?,154,64) is passed as input to the encoder LSTM

There are 100 LSTM cells used in the Encoder architecture. These 100 LSTM cells together form 1 LSTM layer. This LSTM layer will unroll over each time step. The amount of unroll will in turn depend upon the input dimensions. In this case, each input sequence is of shape (?,154). Therefore, the LSTM layer will unroll 154 times in time. This means that the first LSTM layer will take the 1st input and the last LSTM layer will take the last input(154th input).

In the Encoder LSTM declaration, the return_sequences=True and return_state=True . Therefore, the LSTM will return three outputs. The 1st is the LSTM output sequence which is of shape (?, 154, 100). The 2nd output is the final hidden state- State_h of shape (?, 100) . The 3rd output is the final Cell state- State_c which is also of shape (?,100).

Note that all these 3 outputs from the Encoder LSTM are super useful. We will see this later.

10.1.2 Now lets go to the Training decoder architecture.
The decoder LSTM during the training phase also unrolls over each time step. The decoder input is of shape (?, 164), therefore the LSTM layer unrolls 164 times — here “?” is the batch size which is declared later during the model.fit() operation. Note that the decoder input of shape (?, 164) is not directly passed as the LSTM input. First, these inputs have to be converted to 3D vector. To convert into 3D , we could have either one-hot ended the tokens in the input sequence or we can pass it through an Embedding layer. The output of the Embedding layer is of shape (?,164,64); where 64 is the Number of Embedding nodes. Now, (?,164,64) is passed as input to the decoder LSTM

The decoder LSTM initial states (State_h and State_c) is initialized as the Encoder’s final State_h and Final State_c

Even for the Decoder LSTM, return_state=True and return_sequences=True.

Therefore , there are 3 outputs from the LSTM. 1st is the Output sequence of shape (?,164,100). The 2nd output is the final hidden state- State_h of shape (?, 100) . The 3rd output is the final Cell state- State_c which is also of shape (?,100).

Note that the 2nd and 3rd outputs are not of any use in the later part of the model.

10.1.3 Now comes the Attention layer

Ques: What are the inputs to the attention layer?

Ans: The attention layer takes 2 inputs. 1st is the Encoder Output Sequence of shape(?, 154, 100) ; “?” is the Batch Size here. 2nd input is the Decoder Output sequence of shape (?, 164, 100).

Ques: What is the output of the Attention layer ?

Ans: There are 2 outputs. 1st is the attention Context vector and the second is the attention weights.

Snippet taken from the “define_model” function

Note that for our Case study, only the 1st output is useful here.

This Context vector (?,164,100), along with the Decoder output sequence (1,164,100) is concatenated. The result is also of shape (?,164,100) and this is passed to the Final Softmax layer which will generates the prediction sequence all at once. The prediction is of shape (?,164,13521); where 13521 is the VOCAB_SIZE.

In the training phase, the entire sentence is predicted all at once. And then weights of all the layers are updated, depending on the amount of error.

Ques: Where does Teacher forcing come into play ?

Ans: Teacher forcing happened in the Training Decoder architecture. Remember that we passed the Decoder input of shape (1,164). Think once why are we passing the Decoder input (these are the actual answer tokens). This is exactly what Teacher forcing is. We are giving this as an input to tell the LSTM Decoder what the actual answer will be like.

10.2 Let’s explain the Inference Encoder model.

This will be the Encoder-decoder architecture during the Testing phase.
During the inference/ test phase, the Inference Encoder LSTM should unroll over each time step but the Decoder LSTM should not unroll. The Decoder input is of shape(1,1) which looks like [ [3] ] ; where 3 is the token for “prarambh”. Hence there is only one input token because of which the roll over is not required.

The Inference Encoder LSTM definition is the same as the Encoder LSTM that was defined during the training phase. Only difference here is that we define a separate model for Encoder called the Inference Encoder model.

Ques: How does the Inference Encoder model work ?

Ans: The model takes the Encoder Inputs of shape(1,154) and gives 3 outputs. Again, all 3 outputs are super useful.

Snippet taken from the “define_model” function

The shapes of these outputs are (1 , 154 ,100) || state_h- (1 , 100) || state_c-(1 , 100)

Ques: How does the Inference Decoder model work ?

Ans: The Inference Decoder LSTM needs Initial State_h and State_c of shape (?, 100) and (?, 100) respectively.

Snippet taken from the “define_model” function

There are 3 outputs from the Decoder. 1st is the Decoder Output sequence of shape(1, 1, 100) . 2nd is (1, 100) and 3rd is (1, 100) .

The attention layer is clubbed together with the Decoder and this will be evident from the upcoming steps.

Now, the Encoder Output Sequence and Decoder output Sequence are passed as input parameters to the Attention layer. The output of the attention layer is the Context vector and the attention weights.

Again we will make use of Decoder output Sequence .The attention context vector is concatenated with Decoder output Sequence.

Snippet taken from the “define_model” function

The result is then passed to the softmax layer that we had defined during training.

Now we need to create a Model of the above defined Inference Decoder architecture(Attention layer has to be included in this model itself.)

Snippet taken from the “define_model” function

To write this syntax, we need to understand what will be the input and output to the Model.

Remember that the Decoder Inference model has to be called again and again in a for loop so that in each iteration, a word is predicted at the final Softmax layer.

In each iteration of the for loop, the inputs to the Model can be :

  1. Decoder initial input (1,1)
  2. Decoder State_h
  3. Decoder State_c
  4. The 1st input parameters to the Attention layer (i.e- Encoder Output Sequence). We do not need to worry about the 2nd input parameter to the Attention layer (i.e- Decoder output sequence ) as it will be generated in between the process. But the Encoder Output Sequence must be fetched from the Inference Encoder Model’s Output

Ques: What must be the Output from the Inference Decoder Model in each iteration of the for loop ?

The answer to this is: Think in terms of, when the next iteration of for loop runs, then what are the values that should be passed as input to the Inference Decoder model ?

This means that:

  1. The decoder initial input has to be derived from the Output from the final Softmax layer. Which means that the final Softmax Output has the be the 1st Output from the Model
  2. The Decoder State_h has to be updated for the next iteration. So whatever is the State_h generated from the current iteration, becomes the input State_h for the next iteration. So the State_h has to be the 2nd Output.
  3. Similarly, the State_c has to be the 3rd output
  4. The 4th input to the model is the 1st input parameter of the attention layer (i.e- Encoder Output Sequence). We do not need to worry about this as this will remain the same for all the iterations of the for loop. Remember , that this vector we get when we make a call to the Inference Encoder Model and the 1st output of the Inference Encoder Model is the Encoder Output Sequence.

Here’s what the Inference Encoder and Decoder Model looks like :

Self made drawing

For every user Question, the Encoder Model is called only Once. But the Decoder Model is called multiple times in a for loop

Ques: What happens after Training the Train model ?

Ans: After the train model has finished training, we get the best possible weights for the entire train architecture. Now the weights from the train model were parsed and transferred to inference Encoder and inference decoder, respectively.

10.3 Instantiate the 3 models and Compile() the train model

Here’s the summary of the Train model

10.4 Let’s fit the train model and save all the 3 models

Now let’s start the training for the train model for epochs=25, batch_size=64 , validation_split=0.20 and also provide Early stopping with monitor=’val_loss’ and patience=2.

10.5 Load all the 3 Saved models

Let’s look at the plots of all the 3 models.

Graph for the Train model
Graph for the Inference Encoder model
Graph for the Inference Decoder model

10.6 Getting the model’s ready for deployment — Build TFLite models

Now let’s build the TFLite models for the Inference Encoder and Inference Decoder.
This part of the blog is similar to what we saw earlier during the creation of TFLite models for the Seq2Seq architecture without attention.

The Inference Encoder TFLite model is of size 0.969 MB (before the conversion, it was 9 MB) and the Inference Decoder TFLite model is of size 3.636 MB (before the conversion, it was 14 MB)
As always, keep a track of the Input and Output shapes and types for both Inference encoder and Inference decoder.

10.6.1 Let’s use the TFLite model and make predictions

Output of the above cell:

As we can see, the answers are much better than that of the model built without Attention. With Attention, the model is able to form better English sentences

11. Evaluate the Model’s predicted answers using BLEU Score

BLEU Score is a way to compare the quality of machine generated text (also called Candidate text) with the Human answered text (also called Reference text). There can be 1 or more Reference text corresponding to a Candidate text. BLEU score was originally developed for translation but can be used for a variety of NLP tasks.
More detail information about BLEU Score can be found in this link : https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

Let’s see how I used BLEU Score for this Case Study.

Let’s first modify the “predict_answer()” function and incorporate the use of BLEU Score in it. I will have to make use of “corpus_bleu” because the Candidate text (machine generated answer)will be compared with 3 different Reference text (Human answers).

Let’s write a for loop in which I will call the predict_answer() function 2 times. In each iteration of the for loop, the following operations take place :

  • The system prompts the user to enter the Question. This will be passed as parameter to the function as the raw data input (X)
  • The system prompts to enter the 1st , 2nd and 3rd Human answer to the above question. These 3 human answers will be combined as a List and passed to the function as the Target value (Y)
  • After taking the above two mentioned points as paramters, the function will then calculate the model answer, BLEU Score.
  • Additionally, I have also printed the reference and candidate used for calculating the above BLEU Score
  • Theoretically , The closer the Score is to 1 , the better the sentence is

Output of the above code :

As we can see, the BLEU Score is 0.27 and 0.10 for the 1st and 2nd answer respectively.

12. Conclusion

In this case study, we have to remember that the Seq2Seq model with Attention only been able to form English sentences. Also, given the complexity of the email corpus, the model could only learn to generate reasonable sentences and the generated answers would definitely not be as good as human generated answers. so its highly likely that the BLEU scores will be lesser than 0.5 in most of the cases. The closer is it to 1, the better it is.

We also saw that the answers generated by the Seq2Seq model with Attention were much better than the model without Attention. Therefore adding Attention layer does improve the Seq2Seq model’s ability to generate better answers.

We also saw how we can create TFLite models. The size of TFLite model is smaller in size and can be executed in edge devices like Android mobiles or embedded devices with limited memory and computing power. More information about TFLite can be found at https://www.tensorflow.org/lite/guide/get_started

13. Future Works

A. Improve the model architecture by allowing the model to save the model states. This will prevent the model from recalculating the states in case the model needs to predict a new answer based on a user question that’s an extension of the previous question. E.g- How are you ? || How are you doing today ?

B. Use Tensorflow Extended for model deployment. Use TFX to create and manage a production pipeline. Link- https://www.tensorflow.org/tfx

14. My Github Link and LinkedIn:

github — https://github.com/Pattrickps/Automated-Email-Generation-Using-Deep-Learning

LinkedIn — https://www.linkedin.com/in/pratiksen

15. References

Applied AI Course — https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

Reference for learning Encoder-Decoder architecture — https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/

Refence for Attention model : https://github.com/thushv89/attention_keras

Reference for how the attention model can be used: https://github.com/thushv89/attention_keras/blob/master/src/examples/nmt/model.py

Reference for how to create TFLite models:https://github.com/bhattbhavesh91/tflite-tutorials/blob/master/tflite-part-1.ipynb

Bhavesh’s Youtube video on TFLite: https://www.youtube.com/watch?v=bKLL0tAj3GE&ab_channel=BhaveshBhatt

Reference for BLEU Score: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

--

--