Last December we had the opportunity to attend AWS re:Invent 2021 in person and learn first-hand about the new features announced by Amazon Web Services, especially those related to Machine Learning.

SageMaker Studio Lab

This is the free, simplified version of SageMaker Studio. It is a Machine Learning development environment that provides persistent storage (up to 15 GB) and compute capacity (CPU and GPU, with 16 GB of RAM) for free. It also allows integration with the rest of the AWS tools through the Python SDKs (both Sagemaker SDK and Boto3), making it easy to prototype at no cost, and then use the rest of the AWS resources when necessary. On the other hand, it is also an additional option for learning and putting Machine Learning into practice, in addition to Google Colab and the Kaggle notebooks. We recommend this post where ingenious tests are carried out to compare the three environments.


Studiolab Infographic

SageMaker Inference Recommender

This new service is able to run multiple tests automatically related to model inference and recommend to the user: the type of compute instance, the number of instances, configuration parameter values of the containers hosting the code for inference, among other things. This allows objective, evidence-based decisions to be made regarding the deployment of models to optimise costs, and also saves valuable time for developers.

Repository examples

Inference Recommender infographic

SageMaker Serverless Inference

This new service integrates Lambda and SageMaker, and is designed to perform inference on applications with intermittent or unpredictable traffic. Prior to this service, the inference options in SageMaker were real-time inference, batch inference, and asynchronous inference. Now, the new Serverless Inference option is capable of automatically provisioning and scaling compute capacity based on request volume.

Repository examples

SageMaker Serverless Inference infographic

SageMaker Canvas

This is a visual, codeless tool designed for business analysts and other less technical profiles to be able to build and evaluate models, as well as make predictions about new data, through an intuitive user interface. In addition, models generated in Canvas can then be shared with data scientists and developers to make them available in SageMaker Studio.


SageMaker Ground Truth Plus

This is a new labelling service for creating high quality datasets for use in training Machine Learning models. Ground Truth Plus uses innovative techniques from the scientific community, such as active learning, pre-tagging and automatic validation. While SageMaker Ground Truth already existed as of 2019, allowing the user to create data tagging workflows and manage tagging staff, SageMaker Ground Truth Plus creates and manages these workflows automatically without the need for user intervention.


SageMaker Training Compiler

This new feature of SageMaker automatically compiles model training code written in a Python-based Machine Learning framework (TensorFlow or PyTorch) and generates GPU kernels specific to those models. In other words, SageMaker Training Compiler converts models from their high-level language representation to hardware-optimised instructions, so that they will use less memory and less computation and, therefore, train faster.

Repository examples

EMR on SageMaker Studio

This new feature integrates EMR and SageMaker Studio. Prior to this feature, SageMaker Studio users had some ability to search for EMR clusters already created and connect to them, as long as these clusters were running on the same account as the SageMaker Studio session. However, users could not create clusters from SageMaker Studio, but had to manually configure them from EMR to do so. In addition, being restricted to creating and managing clusters in a single account could become prohibitive in organisations working with many AWS accounts. Now, with this new feature, SageMaker Studio users can manage, create, and connect to Amazon EMR clusters from within SageMaker Studio, as well as connect to and monitor Spark jobs running on these clusters.

Repository examples

Beyond the announcements

In addition to the announcements of new features in SageMaker that took place at the event, there are other features that are not exactly new but which we had the opportunity to learn about during the event through presentations, workshops, or informal chats. Here we comment a bit on some of them.

RStudio on SageMaker

This feature is a result of the collaboration between AWS and RStudio PBC and was announced in November 2021. As a result of this collaboration, the RStudio development environment is now available in SageMaker, adding to the SageMaker Studio environment option, which is based on the JupyterLab project. With this addition, data scientists and developers now have the freedom to choose between programming languages and interfaces to switch between RStudio and Amazon SageMaker Studio. All work developed in either environment (code, datasets, repositories and other artefacts) is synchronised through the underlying storage in Amazon EFS.

AWS Announcement
RStudio PBC announcement

Redshift ML

This service was announced in December 2020 and is an integration between Redshift and SageMaker. Basically it is about doing training, evaluation, and deployment of models from Redshift via SQL statements, using SageMaker as a backend.


Infographic on how Amazon Redshift ML works

AutoML en AWS

There are two initiatives related to automated Machine Learning (AutoML) in AWS: SageMaker Autopilot and AutoGluon from AWS Labs (video).

SageMaker Autopilot was announced in December 2019 so it has been on the market for some time now. It refers to AWS’s commercial solution for doing automated Machine Learning (AutoML) and can be used in several ways: on autopilot (hence the name) or with varying degrees of human guidance (without code via Amazon SageMaker Studio, or with code using one of the AWS SDKs).

A remarkable feature is that Autopilot creates three automatic reports on notebooks that describe the plan that has been followed: one related to the data exploration, one related to the definition of the candidate models, and one related to the performance metrics of the final models. Through these reports, the user has access to the complete source code for pre-processing and training of the models, so the user has the freedom to analyse how the models were built and also to make modifications to improve performance. In other words, compared to other commercial AutoML solutions available on the market, Autopilot is a (mostly) white box tool.

Autopilot has predefined strategies for algorithm selection depending on the type of problem, and as for how hyperparameter optimisation is done, Autopilot uses strategies based on random searches and Bayesian optimisation. More information here. In other words, Autopilot follows strategies that fall within the paradigm known as Combined Algorithm Selection and Hyperparameter optimization (“CASH”), which broadly speaking consists of employing strategies to simultaneously find the best type of algorithm and its respective optimal hyperparameters.

Repository examples

Autopilot infographic
Autopilot infographic

AutoGluon is an open source, research-oriented, production-ready project powered by AWS Labs that was released towards the end of 2019. In the words of one of its authors, AutoGluon automates Machine Learning and Deep Learning for applications involving images, text and tabular datasets, and simplifies work for both Machine Learning beginners and experts. It supports both CPU and GPU for the backend. Contains several modules:

  • Tabular prediction1
  • Text prediction
  • Image prediction
  • Object prediction
  • Tuning (of hyperparameters, Python script arguments, and more)
  • Neural architecture search

A rather innovative and powerful feature of AutoGluon is that it introduces its text prediction and image prediction capabilities within its tabular prediction module, allowing you to have the option of Multimodal Tables (example and tutorial here). This is useful in use cases where you have simultaneous numerical, categorical, text, and image data, or any combination of these.

Speaking specifically of the tabular prediction module and its approach to making AutoML, AutoGluon takes an alternative direction to the CASH paradigm (making it different from both SageMaker Autopilot and the other open source projects such as auto-sklearn and H2O AutoML) by following an approach of assembling and stacking models (from different families and from different frameworks) in multiple layers, inspired by practices known to be effective in the Machine Learning community (e.g. in Kaggle competitions). To these ideas, AutoGluon also adds other techniques:

  • A strategy to reduce variance in predictions and reduce the risk of overfitting (k-fold ensemble bagging, also called cross-validated committees)2. This process is highly parallelisable and is implemented in AutoGluon using Ray.
  • A strategy to optimally combine models in the last layer of the ensemble3.
  • A strategy to distil the complex model of the final ensemble into some simpler single model4 that mimics the performance of the complex model and is lighter (and therefore has lower latency in inference time).
Autopilot infographic
AutoGluon’s multi-layer stacking strategy, using two stacking layers and n types of base models.

Amazon Science post

In addition, AutoGluon can be used for both training and deployment in SageMaker:

AutoGluon in SageMaker example
AutoGluon in AWS Marketplace

Finally, AutoGluon can be integrated with Nvidia Rapids, specifically cuML (a feature announced at Nvidia GTC 2021 as a collaboration between AWS Labs and Nvidia) for further acceleration for GPU training:

Nvidia Developer blogpost
Video demo

One of the most stimulating aspects of our work is to be able to assist in real time to the development of new technologies, new approaches or the continuous -and vertiginous- improvement of Artificial Intelligence systems. Whether it is learning new solutions or being involved in the improvement of one of them, the work in Artificial Intelligence is constantly changing.

To keep up to date with all these developments, we can attend the multitude of Data Science conferences that are held every year around the world. Whether they are more generalist or more specialised, they are the ideal place to discover state-of-the-art solutions and share our learnings. For this reason, at BBVA AI Factory we have an extraordinary budget to attend any event or conference relevant to our work.

After this last year and a half of hiatus, many conferences are already planning their return to on-site events.

Domain specific conferences

Recsys 2021 was held last September in hybrid format (online and in-person)


Where: Seoul, South Korea
When: June 21-25, 2022
Format: on line and in-person
⎋ More information

FaccT acronym corresponds to Fairness Accountability and Transparency of socio-technical systems and it gathers researchers and practitioners interested in the social implications of algorithmic systems. Systems based on AI and fueled by big data have applications either in the public sector as in multiple industries such as healthcare, marketing, media, human resources, entertainment or finance to name a few. In order to understand and evaluate the social implications of these systems it is useful to study them from different perspectives, this is why this conference is unique on its own because of its multidisciplinarity. It gathers participants from the fields of computer science, social sciences, law and humanities from industry and academia to reflect on topics such as algorithmic fairness, whether these systems contain inherent risks or potential bias or how to build awareness on their social impacts.

Data + AI Summit 2022

Where: San Francisco, United States
When: June 27-30, 2022
Format: in-person (although some kind of hybrid participation is announced too)
⎋ More information

Formerly known as Spark Summit, the conference -organized by Databricks– offers a broad panoramic of recent developments, use cases and experiences around Apache Spark and other related technologies. The variety of topics can be interesting for many roles in the Data Science and Machine Learning arenas (e.g., Data and ML Engineers, Data Scientists, Researchers, or Key Decision Makers, to name a few), and it’s always oriented towards big data and scalable ML pipelines. The presentations typically have a practical orientation and also include hands-on training workshops and sessions with the original creators of open-source technologies, such as Apache Spark, Delta Lake, MLflow, and Koalas. The contents from the previous edition are available on demand for free here. After two editions in a digital format, this year the Summit will be held in San Francisco although some kind of hybrid participation is announced in the webpage.


Where: Seattle, United States
When: Sept. 18-23, 2022
Format: in-person
⎋ More information

RecSys is the premier international conference that aims at portraying recent developments, trends and challenges in the broad field of recommender systems. It also features tutorials covering state-of-the-art in this domain, workshops, special sessions for industrial partners from different sectors such as travel, gaming and fashion industries, and a doctoral symposium. RecSys started with 117 people in Minnesota in 2007, and has reached this year (i.e. 2022) its sixteenth edition. One of the key aspects of this conference so far has been its good mix between academy and industrial participants/works.

General conferences on ML/AI

A moment during KDD 2019

AAAI Conference on Artificial Intelligence

Where: Vancouver, Canada
When: February 22 – March 1, 2022
Format: in-person
⎋ More information

The Association for the Advancement of Artificial Intelligence (AAAI) is a prestigious society devoted to advancing the understanding of the mechanisms underlying intelligent behavior and their embodiment in machines. Its conference promotes the exchange of knowledge between AI practitioners, scientists, and engineers. It explores advances in core AI and also hosts 39 workshops about a wide range of AI applications like financial services, cybersecurity, health, fairness, etc. Check out the 2021 conference and also a review of the 2020 edition.

Applied Machine Learning Days (AMLD)

Where: EPFL. Lausanne, Switzerland
When: March 26 – 30, 2022
Format: in-person
⎋ More information

Each year the conference consists of different tracks on different topics. It is a conference oriented to the application of Machine Learning, so you can find very varied topics each year. It stands out for its good balance between academia and industry, its keynotes and, above all, its workshops. These are sessions prior to the conference itself where you really learn 100% hands on. Check out the one held in January 2020.

The Society for AI and Statistics

Where: Valencia, Spain
When: March 28 – 30, 2022
Format: in-person (still under discussion)
⎋ More information

Web description: Since its inception in 1985, AISTATS has been an interdisciplinary gathering of researchers at the intersection of artificial intelligence, machine learning, statistics, and related areas. And it is true. It is a mainly statistical conference with applications in the field of Machine Learning. It requires a good knowledge of statistics to understand the concepts discussed there and to be able to exploit it to the fullest. The invited speakers are of a high level (many from the Gaussian side) and the organizers take great care in the choice of the venue ;). Check out the last proceedings.

KDD 2022

Where: Washington, United States
When: August 14-18, 2022
Format: in-person
⎋ More information

KDD is a research conference that has its origins in data mining, but its reach extends to applied machine learning, and nowadays defines itself as “the premier data science conference”. More than other conferences on ML or Ai which are aimed at the academic research community, this one is especially appealing to people with “Data Scientist” as job title. Its main differentiators are: an applied data science track, a track on invited Data Science speakers, and hands-on tutorials. The bar is still very high technically, but a large fraction of the research comes from the real world and corporate research has a high weight.

Within KDD, at BBVA AI Factory we have been actively involved both in the organization and the program committee of the workshop of Machine Learning for Finance for the last 2 years, after participating in the KDD Workshop on Anomaly Detection in Finance in 2019. Read our report on the 2019 conference to grasp a better idea of this event!

Natural Language Processing (NLP) has been one of the key fields of Artificial Intelligence since its inception. After all, language is one of the things that defines human intelligence. In recent years, NLP has undergone a new revolution similar to the one that took place 20 years ago with the introduction of statistical and Machine Learning techniques. This revolution is led by new models based on deep neural networks that facilitate the encoding of linguistic information and its re-use in various applications. With the emergence in 2018 of self-supervised language models such as BERT (Google) – trained on massive amounts of text –  an era is beginning in which Transfer Learning is becoming a reality for NLP, just as it has been for the field of Computer Vision since 2013.

The concept of Transfer Learning is based on the idea of re-using knowledge acquired by performing a task to tackle new tasks that are similar. In reality, this is a practice that we humans constantly engage in during our day-to-day lives. Although we face new challenges, our experience allows us to approach problems from a more advanced stage.

Most Machine Learning algorithms, particularly when supervised, can only solve the task for which they have been trained by examples. In the context of the culinary world, for example, the algorithm would be like a super-specialised chef who is trained to make a single recipe. Asking this algorithm for a different recipe can have unintended consequences, such as making incorrect predictions or incorporating biases.

The aim of using Transfer Learning is for our chef – the best at cooking ravioli carbonara – to be able to apply what s/he has learnt in order to make a decent spaghetti bolognese. Even if the sauce is different, the chef can re-use the previously acquired knowledge when cooking pasta (figure 1).

Figure 1. Applying Transfer Learning to Machine Learning models

This same concept of knowledge re-use, applied to the development of Natural Language Processing (NLP) models, is what we have explored in collaboration with Vicomtech, a Basque research centre specialised in human-machine interaction techniques based on Artificial Intelligence. Specifically, the aim of this joint work has been to learn about the applications of Transfer Learning and to assess the results offered by these techniques, as we see that they can be applicable to natural language interactions between BBVA clients and managers. After all, the purpose of this work is none other than to improve the way in which we interact with our clients.

One of the tasks we have tackled has been the processing of textual information in different languages. For this purpose, we have used public domain datasets. This is the case of a dataset of restaurant reviews, generated for the Semeval 2016 academic competition, which includes reviews in English, Spanish, French, Russian, Turkish, Arabic and Chinese. The aim has been to identify the different aspects or characteristics mentioned (food, ambience or customer service, among others), in English, Spanish and French (Table 1).

Table 1. Distribution of texts, sentences and annotations in the SemEval2016 datasets used

With this exercise we wanted to validate whether Transfer Learning techniques based on the use of BERT models were appropriate for adapting a multi-class classifier to detect aspects or characteristics in different languages. In contrast to this approach, there are alternatives based on translating the text to adapt it to a single language. We can do this by translating the information we will use to train the model, on the one hand, or by directly translating the conversations of the clients we want to classify. However, there are also problems and inefficiencies with these alternatives.

Taking the culinary example we mentioned at the beginning of this article, in our case we could consider the texts as representing the ingredients of the recipe. These datasets of information differ from one language to another (just as the ingredients vary according to the recipe). On the other hand, the ability acquired by the model to classify texts is knowledge that we can re-use in several languages; in the same way that we re-use the knowledge about how to cook pasta with different recipes.

In this experiment we have started from a pre-trained multilingual BERT model in the public domain, and we have performed fine tuning on the restaurant dataset. The following figure shows the procedure (figure 2).

Figure 2. We use a pre-trained multilingual BERT model and performed fine tuning on the restaurant dataset.

The results obtained by adapting this model, trained with generic data, to the dataset of reviews in each language, were similar to those reported in 2016 for the task in English, French and Spanish by more specialised models. This is consistent with the results of a variety of  research on the ability of this type of model to achieve very good results.

Once a classifier for an English text has been tuned, the Transfer Learning process is carried out by performing a second stage of fine tuning with the second language dataset (Figure 3).

Figure 3. Performing a second stage of fine tuning with the second language dataset.

To measure the effectiveness of the process, we compare the behaviour of this classifier with the behaviour resulting from performing a single fine tuning stage starting from the multilingual base model.

The results show us (see Table 2) that by starting with the model in English and using less data than the target language (In this case Spanish or French) we can achieve similar results to those obtained by adapting a model for each language. For example, in the case of Spanish, we would reach a similar performance if we start from the English model and add only 40% of the data in Spanish. On the other hand, in the case of French, the results begin to be equal when using the English model and 80% of data in French. Finally, if we use all the available data, the results improve moderately when compared to the results we achieved when training only with data for each language. However, there is a marked improvement when using the English model for the other languages. It is important to bear in mind that these results will depend on the specific task to which they are applied.

Table 2. Classification metrics for test sets in the target languages (ES or FR) of a classifier trained with different combinations of source (EN) and target (ES or FR) data.

These results are very encouraging from the point of view of an application to real problems, since they would indicate that the models are capable of using the knowledge acquired in one language to extrapolate it to another language, thus obtaining the same quality with less labelled data. In fact, one of the main obstacles when developing any NLP functionality in an industrial environment is having a large amount of quality data, and having to develop this for each given language. Therefore, requiring less labelled data is always a great advantage when developing functionalities.

The knowledge gained (or “transfered”) from this collaboration with Vicomtech will allow us to build more agile functionalities to help managers in their relationship with the client, hence reducing the development cycle of a use case in a language or channel other than the one in which it was originally implemented.

In 2018, coinciding with the Football World Cup, a company ventured to forecast the probabilities of each team becoming champion -the original report is not available but you can still read some posts in the media that covered the story-. Germany topped the list, with a 24% probability. As soon as Germany was eliminated from the group stages, the initial forecast was viewed as being mistaken, which led to the anecdote circulating on social networks.

The problem wasn’t the model itself, of which no details were revealed, although it was said to be based on a simulation methodology, and robust sports forecasting models are known -BTW, for the occasion of the World Cup, BBVA AI Factory also created a visualization of player and team data-. However, the problem was not the report either because it didn’t draw the conclusion that only Germany would win.

The main problem was the interpretation of the result provided by certain media and the wider public who assumed that ‘Germany wins’ would prove to be right even if the numbers said indicated otherwise: the probability was so fragmented, that a 24% for Germany meant that there was a 76% chance that any other team would win, right?

The human tendency to simplify: the “wet bias”

The fact that humans are not good at evaluating probability-based scenarios, is well known to meteorologists. In 2002, a phenomenon called “wet bias” was unveiled: it was observed that meteorological services in some American media were used to deliberately inflate the probability of rain to be much higher than had actually been calculated. In his well-known book “The Signal and the Noise”, the statistician and data disseminator Nate Silver delves into this phenomenon and goes so far as to attribute it to the fact that meteorologists believe that the population, whenever it sees a probability of rain that is too small -say 5%-, will interpret it directly as “it’s not going to rain” -and consequently will be disappointed 5% of the time-.

This suggests that humans tend to simplify information for decision making. The fact is that the 5% chance of rain, or the 24% chance that Germany would win the World Cup, should not be transformed into a black and white decision, but should be taken as information for analysing scenarios. Nate Silver, in his post “The media has a probability problem” or in his last talk at Spark Summit 2020, analyzes this limitation to build scenarios given some probabilities, illustrating it with examples of hurricane forecasting or the 2016 US elections. As Kiko Llaneras argues in his article “En defensa de la estadística” (in Spanish), every prediction has to fall on the improbable side sometime or other.

Designing algorithms correctly from scratch

Those of us who work with Machine Learning in the design of customer-oriented products believe that we should not reproduce that same error of taking the results of forecasts as absolute. It is up to us to properly understand what level of confidence a Machine Learning system has in the result it offers, and to accurately transmit it to the receivers of the information.

For example, if we want to design an algorithm to forecast the expenses that a customer will have and to inform them through the BBVA app, we are interested in being able to analyze how confident the algorithm is in each forecast, and perhaps discard the cases where we do not have high confidence.

Surprisingly, many forecasting algorithms are designed in such a way that they can induce a similar misinterpretation to the one we described in the case of the World Cup. This is because the estimate provided by a forecast model (for example, next month’s expenditure), which takes information observed in the past (expenditure from previous months) results in the form of a single value. And we’ve already discussed what could happen if we reduce everything only to the most likely value. It would be more interesting if the system were able to provide a range -the expenditure will be between 100 and 200 euros-, and aim to reduce the range when it is very certain -for example in the case of a recurrent fixed expenditure-. The system could also extend that range if it is more uncertain, case by case -for example in the case of a holiday, where our expenditure is less predictable-.

At BBVA AI Factory we have worked on a research area, together with the University of Barcelona, to try to develop this type of algorithm using neural network forecasting techniques. This research had already been discussed in other posts and has resulted in publications, including one at the prestigious NeurIPS 2019 conference1.

Thanks to this research, we now have algorithms capable of providing forecasts that result in a range of uncertainty, or a mathematical distribution function, rather than a single value, which offers us more complete information.

Can we trust the black boxes? (Spoiler: Yes, with some tricks)

However, we have to face one more obstacle: offentimes data science teams use models that they did not create themselves: models from others, from external code libraries or APIs, or from software packages. We have a forecasting system that is already in place -for example, next month’s expense estimate, or a balance estimate for the following few days-, and for some good reason it cannot be replaced. The question begs: can we design another algorithm that estimates how confident the first algorithm is, without having to replace it or even modify it?

The response is positive and has been described in our recent article, “Building Uncertainty Models on Top of Black-Box predictive APIs“, published in IEEE Access and signed by authors Axel Brando, Damià Torres, José A. Rodríguez Serrano and Jordi Vitrià from BBVA AI Factory and the University of Barcelona. We describe a neural network algorithm that transforms the prediction given by any existing system into a range of uncertainty. We distinguish two cases: firstly, where we know the details of the system we want to improve. However, we also deal with the case where the system we want to improve is a black box, i.e, a system that we use to generate forecasts but which cannot be modified. Nor do we not know how it has been built. A common real-life scenario, for example, when using software from a supplier.

This opens up the possibility of using any available forecasting system, which works by giving point estimates and, without having to modify it, “augmenting” it with the ability to provide a range of uncertainty, as schematically shown in the figure above. We have verified the system in cases of bank forecasting and in cases of electricity consumption prediction. We are providing the link to the article so that other researchers, data scientists or interested people can consult the details.

The challenge: translating reliability into a human language

With this work, we have achieved the challenge of designing a forecasting system that provides extra information. However, the key question we raised at the beginning remains unanswered: if we build products based on Machine Learning, how do we transfer this information to the end user in a way that they understand that it is a useful estimate, but might present errors?

This is still an open issue. Recently, a presentation by Apple on product design with Machine Learning shed some light on this aspect, and suggested communicating uncertain information in terms of some amount that appeals to the user. Better to say “if you wait to book, you could save 100 euros”, than “the probability of the price going down is 35%”. The latter formula -the most commonly used- could give rise to the same interpretation problems that occurred with the case of Germany in the World Cup. If humans are not statistically minded animals, perhaps the challenge is in the language.