Sometimes, when facing the challenge of modelling the operation of a process where different variables are involved, we find that some of them dominate over the others. These variables will thus exert greater influence compared to the rest. Let us take, for example, processes with declarative data where the variable to be predicted is informed by the actual user. A typical case for a continuous variable might be a home value estimator that can take any numerical value and request information on a form from the user as ranges of values. For instance, with  a drop-down field containing different options. In this case, the information declared can be expected to almost entirely describe the variable to be predicted. The purpose of using an estimator on an already known data can be very varied: to make a discrete variable continuous, to smooth a variable, to detect data entry errors, to refine the declared data to unify criteria, or simply to comply with a regulation.

The importance of a variable can be calculated in different ways depending on the type of model being fitted. For example, in a Linear or Logistic Regression it is usually related to the weight (coefficient) associated with each descriptor variable, while in the case of trees it is usually related to the reduction of the metric used to make the splits. There are also model agnostic techniques; one of the best known is Permutation Feature Importance1. This technique is usually used in models that are already trained, poorly interpretable or highly non-linear. Furthermore, it is based on randomly breaking the relationship between the descriptor variable and the target variable, observing the consequent decrease in the model performance. In any case, regardless of how the importance of a variable in a model is measured, the measurement result in some way represents the dependence between the model output and the variable in question.

In an industrialised process where the output of a model is used to make decisions, it is generally bad practice to over-rely on a low number of variables as these might fail to be reported (or may be misreported) due to technical failure. In such cases, the prediction will be more or less reliable based on the dependence of these same variables. The home value estimator, for example, is highly dependent on the data provided. Therefore, the variable will be misreported if, for instance, the form’s format or data type is updated. This would effectively cause the estimator to generate wholly erroneous outputs.

Another important point is that a model that is highly dependent on one variable can easily be violated, thus causing a potential problem in the acceptance process. Going back to our previous example, let us now assume that the estimated home value provides access to credit, and that the estimator is highly dependent on the average income declared by the user. In other words, the higher the average income, the higher the value of the home. In case this relationship is detected, the estimator can easily be tricked by simply inflating the average income value. This will, in turn, lead to erroneous access to credit.

For the above reasons, diluting the importance of variables in a model makes it robust against failures and attacks. On the other hand, the model generally suffers in terms of performance.

In the case of Linear Regression, it is sometimes possible to impose hard constraints on the coefficients in the optimisation process 2, as well as penalties on the magnitude of the coefficients. However, in the case of decision trees, there is no similar procedure available that applies to the model building algorithm.

A common alternative is to introduce random noise on the dominant variable, trying to strike a balance between decreasing predictive power and the variable’s contribution to the model. It would also be possible to randomly limit the subset of candidate variables in each division of the tree. This, however, presents the disadvantage of leaving the selection ordering and the resulting error metric reduction to chance. Another alternative would be to order this selection by using an error metric that does not select variables that over-contribute during the initial steps of the decision, thereby limiting their importance. However,  this is not a simple procedure as it implies knowing from the outset the extent to which each variable reduces the error.

This last solution, however, sounds reasonable due to the fact that the most important variables come into play during the latter stages of the decision process. This solves the problems outlined at the beginning and means that resulting model would not be so dependent on this type of variable.

In our use case we only had one highly significant variable (~90% in error reduction). The rest of the variables were more or less balanced in terms of importance. We therefore decided to approximate the above solution with a simple procedure we refer to as Extended Trees. This consists of two steps: first, fit a regression tree with all variables except the one with the highest importance, and then extend the leaves of the previous tree using only the variable left out in the previous step.

Figure 1: Schematic diagram of the ExtendedTrees method. The orange nodes are the leaves of the first tree, trained with all variables except the proxy variable. Each of these nodes is, in turn, extended with new shallow trees using only the proxy variable. The output estimate is given on the leaves of the spanning trees (turquoise nodes).

The hyper-parameter set consisting of the depth of the first tree and the depths of the n extents (where n is the number of leaves of the first tree) allows us to control the contribution of the variable whose importance we wish to limit.

By using this approach in our problem we manage to reduce the importance of the proxy variable by ~45% and increase the original error by 3%.

California Housing dataset

In order to better understand this procedure, we show an example application on the California Housing database. It was constructed from the housing census and each entry corresponds to a block group, the smallest geographical unit for which the US census publishes information (typically composed of 600 to 3,000 inhabitants). In total, the dataset consists of 20,640 observations, 8 numerical variables and the target variable, which are: median income (MedInc), median age of dwellings (HouseAge), median number of rooms (AveRooms) and bedrooms (AveBedrms), population, average occupancy of dwellings (AveOccup), latitude and longitude. As well as the variable we are going to predict, which is the median value of the dwellings in a block group.

The following figure shows the importance of each descriptor variable after training a Decision Tree Regressor (DTR) of depth 4 (in blue) and an ExtendedTree (ExT) of initial depth 3 and extension depth 2 (in orange), in order to estimate the median value of the dwellings of each block group. In both cases, the cross-validation technique has been used to obtain the results.

Figure 2: Distribution of importance in a DecisionTreeRegressor (blue) vs ExtendedTree (orange). In this case they are shown normalised (sum to 1) and calculated as the decrease in error contributed by each.

The error metric used is the Mean Squared Error (MSE), whose use is very common in the evaluation of the performance of regression models. In this case, both models have been configured so that the error metric is as similar as possible (0.66 ± 0.04 vs. 0.67 ± 0.05), i.e. trying to ensure that both estimators perform equally well in estimating the median value of the dwellings from the available variables. Such a configuration results in a 31-node tree in the case of DTR and the 61-node tree in the case of ExT. As we can see, the latter is much more complex than the former at equal error.

If we look at the importance of each variable in each model (represented by the height of the bar) we see that the DTR is ~80% dependent on the MedInc variable, the next most important variable is AvgOccup with just under 20% and the rest of the variables barely influence the estimation of the model. In the case of ExT, however, the most important variable is still MedInc but at just over 40%, with the importance being more evenly distributed among the other variables.

It is worth highlighting the case of the HouseAge, AveRooms and Population variables. While in the DTR they have hardly any influence, in the ExT their importance is not negligible. This means that they are valuable variables for price estimation. However, in the case of the DTR they are overshadowed by MedInc. In short, the ExT is able to exploit more uniformly the available variables, while maintaining the error in exchange for generating a more complex structure.

In conclusion, depending on the nature of the data, ExtendedTrees can be a simple solution to the problem of high dependency in Decision Tree models. This mechanism can also be applied in the case of classification. Moreover, ExtendedTrees are easy to understand. Their output structure is the same as that of a Decision Tree, and they allow for the contribution of each variable in a tree to be controlled, within the possibilities and by means of depths. Nevertheless, they can bring about an increase in error or complexity.

In the field of technology, experimentation and innovation are key aspects in our job. This is particularly true when it comes to developing AI-based solutions. At BBVA AI Factory we are committed to allocating specific time slots to experimenting with state-of-the-art technology and also to working on ideas and prototypes that can later be incorporated into BBVA’s portfolio of AI-based solutions. These are what we call innovation sprints.

In one such sprint, we asked ourselves how we could help financial advisors in their conversations with clients. The BBVA managers, who help and advise clients in managing their finances, sometimes search for responses to the most common FAQs within pre-defined answer repositories. This confirmed to us the potential of developing an AI system that could suggest to them possible answers to client questions and respond with a single click. The idea behind this system would be to save them time typing answers that do not require their expert knowledge, thus allowing them to focus on those that provide the most value to the client.

So we got down to work. Once the problem was defined, different ways of addressing it soon began to emerge. On the one hand, we thought of a system to search for the most similar question in the historical data to the one posed by the client, and, subsequently, to evaluate whether the answer given at the time is valid for the current situation. On the other hand, we also tried clustering the questions and suggesting the pre-established canonical answer for the cluster to which the question belongs. However, these solutions required too much inference time or were overly manual.

Eventually, the solution that proved to be the most efficient from the point of view of both inference time and the ability to suggest automatic answers to a number of questions on different topics (without carrying out prior clustering), was the sequence to sequence models, also known as seq2seq models.

¿What is seq2seq?

Seq2seq models take a sequence of items from one domain and generate another sequence of items from a different domain. One of its paradigmatic uses is the automatic translation of texts; a trained seq2seq model enables the transformation of a sequence of words written in one language into a sequence of words that maintains the same meaning in another language. The basic architecture of seq2seq consists of two recurrent networks (decoder and encoder), called Long-Short Term Memory (LSTM).

Figure 1. LSTM recurrent networks (encoder and decoder)

LSTM networks are a type of Neural Network in which each of its cells (hidden units) processes, in order, an element of the sequence. In this case, the representation of a word. The peculiarity of these Neural Networks is that they keep the relevant information of the previous cell, while discarding the information that is not relevant for the following cells. In this way, the network learns not only from isolated data, but also from the information inherent to the sequence, which it accumulates cell by cell. This feature is especially significant in text, since word order is important for constructing correct sentences syntactically and semantically. Technically, it is also a great advantage, since it considerably reduces the computational cost. To learn more about LSTMs we recommend reading this post by Christopher Olah.

To illustrate the concept, let’s think of language models whose purpose is to predict what the most likely next word will be. For example: “Mary was late to ____ own birthday party”. In this case, an important thing to “remember” (information that passes from one cell to the next) is the gender of the last-mentioned subject, so that the network can determine that the word ‘her’ is more likely than ‘his’ to be the correct word.

As mentioned above, the seq2seq architecture is composed of two LSTMs networks: encoder and decoder. Returning to the case of machine translation, the mission of the encoder network is to learn the structure of the given English sentences (input sequences), while the decoder does the same with the Spanish sentences (output sequences). The decoder also learns the relationship between both sequences. In this way, the result of the last cell of the encoder is a vector, called thought vector. This stores the information of the previous cells and therefore, it is a mathematical representation of the English sentence, i.e., the input sequence. Finally, the decoder uses these vectors together with the response also encoded in mathematical representation for training. During training, the network learns the patterns that allow it to associate an input sequence with an output sequence.

Figure 2. Illustrative seq2seq network architecture drawing (training and inference)

How have we applied seq2seq to build our answer suggestion system?

Although translation is one of the most obvious applications, the advantage of these models is that they are very versatile. Transferring their operation to our case, we could “feed” the encoder with the clients’ questions (sequence 1), and the decoder with the answers given by the managers (sequence 2).

This is exactly what we did. We selected from our history more than a million short conversations initiated by the client with their manager (no more than four messages). With this dataset, we performed a classic pre-processing (lowercase, remove punctuation marks) and discarded greetings and goodbyes using regular expressions. This step is useful because one of the hyper parameters of LSTM networks is the length of the sequence to be learned. This is why, by removing this non-relevant information, we can take better advantage of the learning capacity of the model for more diverse or variable sequences, as well as reduce the computational cost of learning longer sequences. Prior to the preprocessing phase, the data arrives anonymized based on our own NER (Named Entity Recognition) library so that sensitive data such as name, phone, email, IBAN (and certain non-sensitive entities such as amounts, dates or time expressions) are masked and treated homogeneously. Consequently, the model gives them the same importance.

Once the network is trained, we use some test questions (not included in the training) and manually evaluate the suggestion of our system. As we can see in the following example, the model is able to suggest a suitable answer to client questions.

Figure 3. Model suggests a suitable answer to client question

However, when the question is related to a client’s particular situation, the suggested answer is not entirely satisfactory. This is because the model does not take into account the specific context of each client.

For example, in the case shown below, the system proposes a response that could be correct but does not consider whether the card has in fact already been sent to the client. This contextual information is currently not contemplated by the model. One of the next steps in this research would be to obtain the model to learn how to generate the response message according to the current situation of the client in a specific case.

Figure 4. Model suggests answer that is not entirely satisfactory (lack of context)

Finally, there are also some cases in which the model answer suggested to the manager is erroneous, as it is not related to the question asked by the client.

Figure 5. Sometimes the model suggests erroneous answer

How do we validate our seq2seq model?

Evaluating the result of an automatically generated text is a complex task. On the one hand, it is impossible to manually evaluate a relatively large test set of questions and answers. In addition, if we wanted to optimize the model, we would subsequently need some or other method to measure the quality and appropriateness of the answers suggested by the model automatically.

However, we are able to make a manual evaluation that allows us to ascertain – with a small set of messages and according to our criteria as clients1. We use two criteria for manual evaluation: answerability and correctness. The first of these qualitative metrics determines in which degree a question can be answered, in order to check what the business impact would be, and the second evaluates whether the answer suggested by the model is correct or not, which gives us an idea of the quality of our seq2seq model. With this evaluation we found that not all client questions can be solved with this approach at the moment. The main reason lies in the need to take into account contextual information, either previously provided by the client, previously commented by a specific manager or personal and financial information of the client. – which answers are appropriate and which are not. Subsequently, the correlation is calculated between our manual evaluation and the evaluation obtained by sets of automatic metrics such as Rouge, Blue, Meteor or Accuracy.

The following figure represents the correlation between the manual evaluation (false: wrong answer, true: correct answer) and the values of the automatic metrics. In a first analysis we observe that bleu_1, rouge_1 and especially rouge-L are the metrics that best align with the human criteria. This is important for the optimization of the seq2seq architecture and for the automatic evaluation of the system. Although this study would require further research, we consider it sufficient for this prototyping phase.

Figure 6. Boxplot grouped by correctness.

With these first tests performed with seq2seq (we have omitted some failed experiments and other decisions made along the way), we have been able to demonstrate the enormous potential of this technique in the context of client-manager communications at BBVA. A seq2seq-based system is able to suggest answers to simple client questions. However, for many other tasks, such as answering questions related to the specific context of a client, it is much better to opt for a direct contact with our BBVA managers.