In the field of technology, experimentation and innovation are key aspects in our job. This is particularly true when it comes to developing AI-based solutions. At BBVA AI Factory we are committed to allocating specific time slots to experimenting with state-of-the-art technology and also to working on ideas and prototypes that can later be incorporated into BBVA’s portfolio of AI-based solutions. These are what we call innovation sprints.

In one such sprint, we asked ourselves how we could help financial advisors in their conversations with clients. The BBVA managers, who help and advise clients in managing their finances, sometimes search for responses to the most common FAQs within pre-defined answer repositories. This confirmed to us the potential of developing an AI system that could suggest to them possible answers to client questions and respond with a single click. The idea behind this system would be to save them time typing answers that do not require their expert knowledge, thus allowing them to focus on those that provide the most value to the client.

So we got down to work. Once the problem was defined, different ways of addressing it soon began to emerge. On the one hand, we thought of a system to search for the most similar question in the historical data to the one posed by the client, and, subsequently, to evaluate whether the answer given at the time is valid for the current situation. On the other hand, we also tried clustering the questions and suggesting the pre-established canonical answer for the cluster to which the question belongs. However, these solutions required too much inference time or were overly manual.

Eventually, the solution that proved to be the most efficient from the point of view of both inference time and the ability to suggest automatic answers to a number of questions on different topics (without carrying out prior clustering), was the sequence to sequence models, also known as seq2seq models.

¿What is seq2seq?

Seq2seq models take a sequence of items from one domain and generate another sequence of items from a different domain. One of its paradigmatic uses is the automatic translation of texts; a trained seq2seq model enables the transformation of a sequence of words written in one language into a sequence of words that maintains the same meaning in another language. The basic architecture of seq2seq consists of two recurrent networks (decoder and encoder), called Long-Short Term Memory (LSTM).

Figure 1. LSTM recurrent networks (encoder and decoder)

LSTM networks are a type of Neural Network in which each of its cells (hidden units) processes, in order, an element of the sequence. In this case, the representation of a word. The peculiarity of these Neural Networks is that they keep the relevant information of the previous cell, while discarding the information that is not relevant for the following cells. In this way, the network learns not only from isolated data, but also from the information inherent to the sequence, which it accumulates cell by cell. This feature is especially significant in text, since word order is important for constructing correct sentences syntactically and semantically. Technically, it is also a great advantage, since it considerably reduces the computational cost. To learn more about LSTMs we recommend reading this post by Christopher Olah.

To illustrate the concept, let’s think of language models whose purpose is to predict what the most likely next word will be. For example: “Mary was late to ____ own birthday party”. In this case, an important thing to “remember” (information that passes from one cell to the next) is the gender of the last-mentioned subject, so that the network can determine that the word ‘her’ is more likely than ‘his’ to be the correct word.

As mentioned above, the seq2seq architecture is composed of two LSTMs networks: encoder and decoder. Returning to the case of machine translation, the mission of the encoder network is to learn the structure of the given English sentences (input sequences), while the decoder does the same with the Spanish sentences (output sequences). The decoder also learns the relationship between both sequences. In this way, the result of the last cell of the encoder is a vector, called thought vector. This stores the information of the previous cells and therefore, it is a mathematical representation of the English sentence, i.e., the input sequence. Finally, the decoder uses these vectors together with the response also encoded in mathematical representation for training. During training, the network learns the patterns that allow it to associate an input sequence with an output sequence.

Figure 2. Illustrative seq2seq network architecture drawing (training and inference)

How have we applied seq2seq to build our answer suggestion system?

Although translation is one of the most obvious applications, the advantage of these models is that they are very versatile. Transferring their operation to our case, we could “feed” the encoder with the clients’ questions (sequence 1), and the decoder with the answers given by the managers (sequence 2).

This is exactly what we did. We selected from our history more than a million short conversations initiated by the client with their manager (no more than four messages). With this dataset, we performed a classic pre-processing (lowercase, remove punctuation marks) and discarded greetings and goodbyes using regular expressions. This step is useful because one of the hyper parameters of LSTM networks is the length of the sequence to be learned. This is why, by removing this non-relevant information, we can take better advantage of the learning capacity of the model for more diverse or variable sequences, as well as reduce the computational cost of learning longer sequences. Prior to the preprocessing phase, the data arrives anonymized based on our own NER (Named Entity Recognition) library so that sensitive data such as name, phone, email, IBAN (and certain non-sensitive entities such as amounts, dates or time expressions) are masked and treated homogeneously. Consequently, the model gives them the same importance.

Once the network is trained, we use some test questions (not included in the training) and manually evaluate the suggestion of our system. As we can see in the following example, the model is able to suggest a suitable answer to client questions.

Figure 3. Model suggests a suitable answer to client question

However, when the question is related to a client’s particular situation, the suggested answer is not entirely satisfactory. This is because the model does not take into account the specific context of each client.

For example, in the case shown below, the system proposes a response that could be correct but does not consider whether the card has in fact already been sent to the client. This contextual information is currently not contemplated by the model. One of the next steps in this research would be to obtain the model to learn how to generate the response message according to the current situation of the client in a specific case.

Figure 4. Model suggests answer that is not entirely satisfactory (lack of context)

Finally, there are also some cases in which the model answer suggested to the manager is erroneous, as it is not related to the question asked by the client.

Figure 5. Sometimes the model suggests erroneous answer

How do we validate our seq2seq model?

Evaluating the result of an automatically generated text is a complex task. On the one hand, it is impossible to manually evaluate a relatively large test set of questions and answers. In addition, if we wanted to optimize the model, we would subsequently need some or other method to measure the quality and appropriateness of the answers suggested by the model automatically.

However, we are able to make a manual evaluation that allows us to ascertain – with a small set of messages and according to our criteria as clients1. We use two criteria for manual evaluation: answerability and correctness. The first of these qualitative metrics determines in which degree a question can be answered, in order to check what the business impact would be, and the second evaluates whether the answer suggested by the model is correct or not, which gives us an idea of the quality of our seq2seq model. With this evaluation we found that not all client questions can be solved with this approach at the moment. The main reason lies in the need to take into account contextual information, either previously provided by the client, previously commented by a specific manager or personal and financial information of the client. – which answers are appropriate and which are not. Subsequently, the correlation is calculated between our manual evaluation and the evaluation obtained by sets of automatic metrics such as Rouge, Blue, Meteor or Accuracy.

The following figure represents the correlation between the manual evaluation (false: wrong answer, true: correct answer) and the values of the automatic metrics. In a first analysis we observe that bleu_1, rouge_1 and especially rouge-L are the metrics that best align with the human criteria. This is important for the optimization of the seq2seq architecture and for the automatic evaluation of the system. Although this study would require further research, we consider it sufficient for this prototyping phase.

Figure 6. Boxplot grouped by correctness.

With these first tests performed with seq2seq (we have omitted some failed experiments and other decisions made along the way), we have been able to demonstrate the enormous potential of this technique in the context of client-manager communications at BBVA. A seq2seq-based system is able to suggest answers to simple client questions. However, for many other tasks, such as answering questions related to the specific context of a client, it is much better to opt for a direct contact with our BBVA managers.

Natural Language Processing (NLP) has been one of the key fields of Artificial Intelligence since its inception. After all, language is one of the things that defines human intelligence. In recent years, NLP has undergone a new revolution similar to the one that took place 20 years ago with the introduction of statistical and Machine Learning techniques. This revolution is led by new models based on deep neural networks that facilitate the encoding of linguistic information and its re-use in various applications. With the emergence in 2018 of self-supervised language models such as BERT (Google) – trained on massive amounts of text –  an era is beginning in which Transfer Learning is becoming a reality for NLP, just as it has been for the field of Computer Vision since 2013.

The concept of Transfer Learning is based on the idea of re-using knowledge acquired by performing a task to tackle new tasks that are similar. In reality, this is a practice that we humans constantly engage in during our day-to-day lives. Although we face new challenges, our experience allows us to approach problems from a more advanced stage.

Most Machine Learning algorithms, particularly when supervised, can only solve the task for which they have been trained by examples. In the context of the culinary world, for example, the algorithm would be like a super-specialised chef who is trained to make a single recipe. Asking this algorithm for a different recipe can have unintended consequences, such as making incorrect predictions or incorporating biases.

The aim of using Transfer Learning is for our chef – the best at cooking ravioli carbonara – to be able to apply what s/he has learnt in order to make a decent spaghetti bolognese. Even if the sauce is different, the chef can re-use the previously acquired knowledge when cooking pasta (figure 1).

Figure 1. Applying Transfer Learning to Machine Learning models

This same concept of knowledge re-use, applied to the development of Natural Language Processing (NLP) models, is what we have explored in collaboration with Vicomtech, a Basque research centre specialised in human-machine interaction techniques based on Artificial Intelligence. Specifically, the aim of this joint work has been to learn about the applications of Transfer Learning and to assess the results offered by these techniques, as we see that they can be applicable to natural language interactions between BBVA clients and managers. After all, the purpose of this work is none other than to improve the way in which we interact with our clients.

One of the tasks we have tackled has been the processing of textual information in different languages. For this purpose, we have used public domain datasets. This is the case of a dataset of restaurant reviews, generated for the Semeval 2016 academic competition, which includes reviews in English, Spanish, French, Russian, Turkish, Arabic and Chinese. The aim has been to identify the different aspects or characteristics mentioned (food, ambience or customer service, among others), in English, Spanish and French (Table 1).

Table 1. Distribution of texts, sentences and annotations in the SemEval2016 datasets used

With this exercise we wanted to validate whether Transfer Learning techniques based on the use of BERT models were appropriate for adapting a multi-class classifier to detect aspects or characteristics in different languages. In contrast to this approach, there are alternatives based on translating the text to adapt it to a single language. We can do this by translating the information we will use to train the model, on the one hand, or by directly translating the conversations of the clients we want to classify. However, there are also problems and inefficiencies with these alternatives.

Taking the culinary example we mentioned at the beginning of this article, in our case we could consider the texts as representing the ingredients of the recipe. These datasets of information differ from one language to another (just as the ingredients vary according to the recipe). On the other hand, the ability acquired by the model to classify texts is knowledge that we can re-use in several languages; in the same way that we re-use the knowledge about how to cook pasta with different recipes.

In this experiment we have started from a pre-trained multilingual BERT model in the public domain, and we have performed fine tuning on the restaurant dataset. The following figure shows the procedure (figure 2).

Figure 2. We use a pre-trained multilingual BERT model and performed fine tuning on the restaurant dataset.

The results obtained by adapting this model, trained with generic data, to the dataset of reviews in each language, were similar to those reported in 2016 for the task in English, French and Spanish by more specialised models. This is consistent with the results of a variety of  research on the ability of this type of model to achieve very good results.

Once a classifier for an English text has been tuned, the Transfer Learning process is carried out by performing a second stage of fine tuning with the second language dataset (Figure 3).

Figure 3. Performing a second stage of fine tuning with the second language dataset.

To measure the effectiveness of the process, we compare the behaviour of this classifier with the behaviour resulting from performing a single fine tuning stage starting from the multilingual base model.

The results show us (see Table 2) that by starting with the model in English and using less data than the target language (In this case Spanish or French) we can achieve similar results to those obtained by adapting a model for each language. For example, in the case of Spanish, we would reach a similar performance if we start from the English model and add only 40% of the data in Spanish. On the other hand, in the case of French, the results begin to be equal when using the English model and 80% of data in French. Finally, if we use all the available data, the results improve moderately when compared to the results we achieved when training only with data for each language. However, there is a marked improvement when using the English model for the other languages. It is important to bear in mind that these results will depend on the specific task to which they are applied.

Table 2. Classification metrics for test sets in the target languages (ES or FR) of a classifier trained with different combinations of source (EN) and target (ES or FR) data.

These results are very encouraging from the point of view of an application to real problems, since they would indicate that the models are capable of using the knowledge acquired in one language to extrapolate it to another language, thus obtaining the same quality with less labelled data. In fact, one of the main obstacles when developing any NLP functionality in an industrial environment is having a large amount of quality data, and having to develop this for each given language. Therefore, requiring less labelled data is always a great advantage when developing functionalities.

The knowledge gained (or “transfered”) from this collaboration with Vicomtech will allow us to build more agile functionalities to help managers in their relationship with the client, hence reducing the development cycle of a use case in a language or channel other than the one in which it was originally implemented.