The conference was held in Vancouver, Canada from December 8-14, and was organized with tutorials, workshops, demos, presentations and poster sessions. In this article we summarize some relevant aspects of the event.

## Some numbers

With approximately **13,000** people registered, this was the largest edition of NeurIPS to date. There was a record number of **6,743 **articles submitted (49% more than the previous year’s edition), reviewed and commented on by a total of **4,543** reviewers, resulting in **1,428** accepted articles (i.e. an acceptance rate of 21.6%). There were 9 tutorials and 51 workshops with a variety of topics such as: reinforcement learning, bayesian machine learning, imitation learning, federated learning, optimization algorithms, representation of graphs, computational biology, game theory, privacy and fairness in learning algorithms, etc.

In addition, there were a total of 79 official NeurIPS meetups in more than 35 countries from 6 continents. These meetups are local events in which the content of the live conference is discussed; most of these were held in Africa (28) and Europe (21). For a more detailed description of the issues included this edition of NeurIPS we recommend the following post.

A fun little demonstration of the scale of #NeurIPS2019 – video of people heading into keynote talk.

— Andrey Kurenkov (@andrey_kurenkov) December 12, 2019

This is 9 minutes condensed down to 15 seconds, and this is not even close to all the attendees! pic.twitter.com/1VqAHZoqtj

## Industry presence

As outlined in this excellent post with some statistics on the articles accepted for NeurIPS 2019, we can note the significant presence of authors with affiliations to large technology companies. For example, in the top-5 institutions with the highest number of accepted articles, Google (through Google AI, Google Research or DeepMind) and Microsoft Research appear together with MIT, Stanford University, and Carnegie Mellon University. In the top-20 we can also appreciate industry presence with Facebook AI Research, IBM Research, and Amazon. Below is a list of the sites of each of the companies with the greatest presence at the event, where they list the articles, tutorials and workshops in which they participated.

- Google AI
- Microsoft Research
- Facebook AI Research
- IBM Research
- Amazon
- Intel AI
- Nvidia Research
- Apple

As expected, most of the research of these companies is still oriented towards language understanding, translation between languages, voice recognition, visual and audio perception, etc. We note that a large part of the research has been aimed towards designing new variants of BERT-based language models and tools for understanding the representations that they produce ^{1} ^{2} ^{3} ^{4}. For a more detailed description of several articles related to BERT and Transformers in NeurIPS 2019, we recommend this post.

Additionally, we were able to appreciate the amount of research on the part of these companies (with or without collaboration of academic institutions), directed towards the fields of multi-agent reinforcement learning ^{5} ^{6} ^{7}, motivated by the data that can be obtained through the interaction of multiple agents in real scenarios on the Internet, and multi-armed bandits ^{8} ^{9} ^{10}, where the work focuses particularly on contextual bandits ^{11} ^{12}, motivated by applications related to personalization and web optimization.

On the other hand, while the increasing industrial presence at this conference is enriching, it is also true that the unofficial and exclusive events organized by the companies can divert the focus from the scientific and inclusionary objectives of the conference.

## Trends

According to the analysis of the program chair of this edition of NeurIPS, Hugo Larochelle, the category that obtained the greatest increase in its acceptance rate with respect to the previous edition was neuroscience.

On the other hand, comparing the keywords of the accepted articles from 2019 and 2018, described in this excellent post by Chip Huyen, we can see which topics have gained or lost presence compared to last year.

Main findings:

- Reinforcement learning is gaining significant ground. In particular, the algorithms based on multi-armed bandits and contextual bandits. This trend can be seen in the high recurrence of words such as “bandit”, “feedback”, “regret”, and “control”.
- Recurrent and Convolutional Neural Networks seem to be losing ground. At least in the case of recurrent ones, it is likely that this is due to the fact that since NeurIPS 2017, all we need are attention mechanisms.
- The words “hardware” and “privacy” are gaining ground, reflecting the fact that there is more work related to the design of algorithms that take into account hardware architectures, or issues related to privacy.
- Words like “convergence”, “complexity” and “time” are also gaining ground, reflecting the growing research in theory related to deep learning.
- Research related to graph representation is also on the rise, and Kernel-based methods seem to be enjoying a resurgence, as evidenced by positive percentage changes in the words “graph” and “kernel”, respectively.
- Interestingly, although the percentage change of Bayesian decreases, the percentage change of uncertainty increases. This could be due to the fact that in 2018 there were numerous projects related to Bayesian principles not exactly related to deep learning (e.g. Bayesian optimization, Bayesian regression, etc.), while in 2019 there was a greater tendency to incorporate uncertainty, and in general, probabilistic elements into deep learning models.
- “Meta” is the word with the highest positive percentage change, reflecting the fact that meta-learning algorithms in their various variants are becoming popular (e.g. graph meta-learning, meta-architecture search, meta-reinforcement learning, meta-inverse reinforcement learning, etc).

## Award-winning articles

In this edition of NeurIPS there was a new category called Outstanding New Directions Paper Award. According to the organizing committee, this award is given to highlight work that distinguishes itself by establishing lines of future research. The winning paper was Uniform Convergence may be unable to explain generalization in deep learning. Broadly speaking, the article explains both theoretically and empirically that Uniform Convergence Theory cannot explain the ability of deep learning algorithms to generalize on its own, so it therefore calls for the development of techniques that do not depend on this particular theoretical tool to explain the ability to generalize.

On the other hand, the Test of Time Award – a prize given to a paper presented at the NeurIPS ten years ago and which has shown to have a lasting impact in its particular field – was given to Dual Averaging Method for Regularized Stochastic Learning and Online Optimization. Here, very broadly speaking, Lin Xiao proposed a new algorithm to solve convex online optimization problems in order to exploit the L1 regularization structure and that provides theoretical guarantees.

To check out the complete list of articles that have received recognition, please consult this post written by the organizing committee. For a more detailed discussion of some of the award-winning articles this post is strongly recommended.

## Tutorials

Tutorials are sessions of a couple of hours duration that aim to present recent research on a particular topic by a leading researcher in their field. We had the opportunity to attend two of these:

- Deep Learning with Bayesian Principles. Emtiyaz Khan listed concepts of deep learning and Bayesian learning to establish similarities and differences between the two. Among the main differences, he argued that while deep learning uses a more empirical approach, using Bayesian principles forces the establishment of hypotheses from the beginning. In this sense, Khan’s work focuses on reducing the gap between the two. During the tutorial, he showed how from Bayesian principles it is possible to develop a theoretical framework where the most used learning algorithms in the deep learning community (SGD, RMSprop, Adam, etc.) emerge as particular cases. In our opinion, the material presented in this tutorial is exquisite and it is impressive to see how such optimization techniques can be deduced in an elegant way through probabilistic principles.
- Reinforcement Learning: Past, Present, and Future Perspectives. Katja Hofmann presented a fairly comprehensive overview and historical review of the most significant advances in the field of reinforcement learning. In addition, as a conclusion, Katja pointed out future opportunities in both research and real applications, with emphasis on the field of multi-agent reinforcement learning, especially in collaborative environments. We believe that this tutorial presents excellent material for those who want to get started in the study of reinforcement learning and for those who want to be aware of present and future lines of research in this field.

## Workshops

Workshops are sessions of several hours duration that are organized as part of the conference, consisting of lectures and poster sessions, and that seek to promote knowledge and collaboration in emerging areas. We had the opportunity to be present mainly in the following:

- Deep Reinforcement Learning: This workshop was one of the best attended and brought together researchers working at the intersection of reinforcement learning and deep learning. The poster sessions were very rich and rewarding and, as expected, there was a big presence of researchers from DeepMind, Google Research and BAIR.We recommend this talk by Oriol Vinyals on recent advances in DeepMind with StarCraft II using multi-agent reinforcement learning tools.
- Beyond First Order Methods in ML: This workshop started from the premise that second and higher order optimization methods are undervalued in applications to machine learning problems. During the workshop, topics such as second-order methods, adaptive gradient-based methods, regularization techniques, etc. were discussed. In particular, we recommend this talk by Stephen Wright, a prominent figure within the optimization community, on smooth and non-convex optimization algorithms.

Other workshops that we attended in part were Bayesian Deep Learning and The Optimization Foundations of Reinforcement Learning. We strongly recommend reviewing the material relating to these.

## Poster sessions

Here is small selection of valuable articles that we saw during the poster sessions and that we want to share with you in this section:

- Multilabel reductions: what is my loss optimising?. This paper describes popular choices of target functions to address multi-label classification problems and argues that, although these have been shown to be empirically successful, we do not understand about how they relate to the two most popular metrics for these types of problems: precision@k and recall@k. The article seeks to fill that gap and provides a formal justification for both ways of choosing the target function.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. This is the most recent PyTorch article par excellence. Here the authors detail the principles that drove the implementation of PyTorch and how these are reflected in the architecture.
- rlpyt: A Research Code Base for Deep Reinforcement Learning in PyTorch. This presents the rlpyt project, which includes PyTorch implementations of the most common algorithms used in deep reinforcement learning. It also supports the interface of OpenAI Gym environments.
- The PlayStation Reinforcement Learning Environment (PSXLE). This presents the PSXLE project, a modified PlayStation emulator designed to be used as a new environment for evaluating reinforcement learning algorithms. It also supports the OpenAI Gym interface.
- Competitive Gradient Descent: This paper addresses certain types of Gradient Descent applications from the point of view of competitive optimization and presents a new algorithm to obtain the solution that corresponds to Nash’s balance in a two-player game. This work is interesting because it is related to problems that arise in reinforcement learning or in deep learning (e.g. in GAN’s, with the generating network and the discriminating network in a competitive game).

## Conclusions

To sum up, NeurIPS 2019 was a huge and enriching event, as well as overwhelming. The high number of attendees posed certain challenges in terms of some aspects of the conference. For instance, interacting properly with the authors of the papers during the poster sessions. Indeed, due to the large number of papers and topics it became impossible to strike an ideal balance between exploration and exploitation during the workshop talks and poster sessions. Hence, a random search seems to be the best strategy. Nevertheless, we still managed to discover papers and lines of research that have served for intellectual growth and inspiration.

As we announced in this post, the Graph Analytics Team at the AI Factory delivers graph data assets and creates an internal software library to facilitate the use of graph algorithms at BBVA.

This article aims to summarize what we saw and the progress made in Graph Analytics during 2019. It can be considered as a collection of resources and references that we find relevant for this topic. This includes Open Source code repositories and Commercial tools, scientific papers, conferences, workshops and other types of publications. All of this has inspired our recent work.

## Tracking adoption of graph technologies in businesses

The adoption of graph-related technologies by businesses follows a maturity curve which typically goes from the initial phase of using graphs in a single use case to an ideal situation where the company successfully exploits graph data and tools on a recurrent basis. In this process we could distinguish four different phases, which mark the path towards the adoption of graphs:

Some players have already presented their proposals for the phases towards graph adoption, i.e. Neo4J’s talk at Spark Summit, or the following article from Forbes.

## Some use cases and tools in data science conference

Next, we describe technical updates and advances in Graphs, as presented at relevant conferences during the course of 2019.

### Graphs at Spark+AI Summit Europe 2019

The Spark + AI Summit Europe 2019 showcased some use cases of Graph Analytics. Tiger Graph showed how they worked with China Mobile to detect phone-based scams using real-time graph analytics. Using a graph of phone calls of 600M users they create graph features used in a machine learning model classifying a phone call as being a scam call or non-scam call in real-time.

Another example comes from AstraZeneca. This biopharmaceutical company showed how they use a Knowledge Graph in the process of drug discovery, where graph analytics is able to reduce the usual huge costs and time requiredments.

In this same conference, we found out about some changes planned for Spark 3. A new module based on Spark SQL Dataframes will be added to represent graphs using the Property Graph data model. The Property Graph data model allows practitioners to represent graphs with different node and edge types and define relating properties. This provides considerable flexibility when representing graphs. Moreover, the Cypher query language is planned to be added, which will allow for expressive and efficient data querying in property graphs.

### Graphs at KDD 2019

At KDD, Alibaba presented their AliGraph platform(in the financial domain, we saw this other example from Capital One). We saw articles proposing graphs to jointly model concepts and instances in a knowledge base, and many deep learning algorithms for graphs, including “OAG: Toward Linking Large-scale Heterogeneous Entity Graphs”, a deep learning algorithm for record linkage in large entity graphs, with GitHub repository.

The MIT-IBM lab released The Elliptic Dataset, a large dataset of financial (bitcoin) transactions, and presented a paper on anti-money laundering at the KDD workshop on Anomaly Detection in Finance.

### Graphs Workshops in the top ML conferences

- NeurIPS 2019 Workshop about “Graph Representation Learning”
- ICLR 2019 Workshop about “Representation Learning on Graphs and Manifolds”
- ICML 2019 Workshop about “Learning and Reasoning with Graph-Structured Representations”

## Graphs in Recommender Systems

The relations between heterogeneous nodes in graphs has proven to be a rich source of information for recommender systems and entity representation. Some recent examples:

**Collaborative Similarity Embedding for Recommender Systems**^{1}. In The World Wide Web Conference (pp. 2637-2643). In this paper, authors present a “unified framework that exploits comprehensive collaborative relations available in a user-item bipartite graph for representation learning and recommendation.” To determine relations between these two types of entities, the authors make use of proximity relations that capture both explicit (user-item) and implicit (user-user and item-item) relations.**N2VSCDNNR: A Local Recommender System Based on Node2vec and Rich Information Network**^{2}. IEEE Transactions on Computational Social Systems, 6(3), 456-466. In this paper, authors presented a “novel clustering recommender system based on node2vec technology and rich information network.” Specifically, their proposal (i) transforms bipartite graphs to the corresponding single-mode projections (i.e., user-item relations to user-user and item-item ones); (ii) learns node representations using node2vec; and (iii) links users’ and items’ clusters to obtain personalized recommendations.

## Other papers we found relevant

A review of the state of the art where they provide a taxonomy which groups graph neural networks into four categories:

- graph recurrent neural networks,
- graph convolutional neural networks
- graph autoencoders
- spatial-temporal graph neural networks

Another review that addresses the following topics:

- Hyperbolic Graph Embeddings
- Logics & Knowledge Graph Embeddings
- Markov Logic Networks Strike Back
- Conversational AI & Graphs
- Pre-training and Understanding Graph Neural Nets

Graph Inference using conditional probabilities: Variational Spectral Graph Convolutional Networks. The authors propose a “Bayesian approach to spectral graph convolutional networks (GCNs) where the graph parameters are considered as random variables. We develop an inference algorithm to estimate the posterior over these parameters and use it to incorporate prior information that is not naturally considered by standard GCN.”

## Tutorials and books

- Neo4J book with practical examples in Neo4j and Apache Spark.
- Dissemination vídeo from Neo4J at Spark Summit.
- Knowledge Graphs tutorial

## Commercial tools

- Amazon Neptune
- Tiger Graph: Native Parallel Graph Platform
- Graphext, a platform for graph data analysis and visualization (that you don’t need a strong technical background to use).

## Open Source Code Repositories

- Upenn GNNs
- Tutorial for Knowledge Graphs in linguistic
- Notebooks from a workshop at AMLD 2019 about Network Science, Spectral Graph Theory, Graph Signal Processing, and Machine Learning.
- A multi-agent simulator of anti-money laundering
- Open source library based on TensorFlow that predicts links between concepts in a knowledge graph. Developed by Accenture.
- Graph algorithms performance comparation with GPUs
- Knowledge Graph open source tool: Blue Brain Nexus
**.**A very interesting initiative and an opportunity to put into practice at a workshop atApplied Machine Learning Days 2019

In this post, I would like to explain the topic of my work during the 2018 Internship, continuing the research I did in 2017 and explained in another post. The problem we try to solve is the joint classification and tag prediction for short texts.

## Tag prediction and classification

This machine learning problem arises in practical applications such as categorizing product descriptions in ecommerce sites, classifying names of shops in business datasets or organizing titles / questions in online discussion forums. In applications related to banking, this problem can appear in transaction categorization engines, for instance as part of personal finance management applications, where the problem is to assign a category, and possibly tags, to an account transaction based on the words appearing in the transaction.

In all those cases, text strings describing products, shops or post titles can normally be classified in a main category (e.g. Joan’s Coffee Shop belongs to the category “Bars and Restaurants”), but it might want to infer other informative tags (such as “coffee” or “take-away”) as well.

Predicting categorical outputs, or predicting an open set of tags based on a short text can be tackled with deep learning techniques. However, both problems are not fully independent, since one may want to impose consistency on category and tags. Therefore, a challenge is how to design a model able to jointly learn tags and categories, given some input sentences (and possibly some known tags).

Additionally, there are situations where explainability is required, and thus we need insights of why a short text was assigned a given class. In the case of text classification, one example is pointing to the words that carried most of the information for classification, typically achieved through attention models. Again, a challenge would be how to add explainability to joint category and tag prediction. The final goal was to explore a neural network architecture that takes as input a sentence and optionally set of known tags, and is able to fulfill these three properties -we were not aware of any existing short sentence classification method with the three above properties-:

- Classify the sentence, taking into account the observed words and tags
- Predict missing tags, based on observed words and the other tags
- Score the input words and tags by importance; i.e. quantify or explain how much each observed word or tag contributes to the decision.

The model I present here uses a first attention model to build an embedding of a sentence in terms of the input words and tags. This embedding is used to score concepts from a fixed concept vocabulary. Finally, the embeddings of the scored concepts are pooled into a final sentence representation that can be used for scoring. The concept scoring can be interpreted as a second attention mechanism, where the input sentence is ”reconstructed” in terms of a vocabulary of known concepts. An illustration of this network is depicted below:

Therefore, with this model we obtain classification, but also interpretability in terms of input words and tags, and induced tags. We have applied this model to a dataset of classifying Stack Overflow questions and to the industrial task of classifying names of shopping establishments in transactional data. A technical report of the internship can be found here.

At BBVA we have been working for some time to leverage transactional data of our clients and Deep Learning modes to offer a personalized and meaningful digital banking experience. Our ability to foresee recurrent income and expenses in an account is unique in the sector. This kind of forecasting helps customers plan budgets, act upon a financial event, or avoid overdrafts. All of this, while reinforcing the concept of “peace of mind”, which is what a bank such as BBVA aims to herald.

The application of Machine Learning techniques to predict an event as being recurrent, together with the amount of money involved, allowed us to develop this functionality. As a complement to this project, at BBVA Data & Analytics we are investing in research and development to study the feasibility of Deep Learning methods in forecasting^{1} . This has already been explained in the post “There is no such thing as a certain prediction”. The goal was not simply to improve the current system, but to generate new knowledge to validate these novel techniques.

As a result, we have observed that Deep Learning contributes to reducing errors in forecasting. Nonetheless, we have also have seen that there are still cases in which certain expenses are not predictable. Indeed, simple Deep Learning for regression does not offer a mechanism to determine uncertainty and hence measure reliability.

Making good predictions is as important as detecting the cases in which those predictions have an ample range. Therefore, we would like to be able to include this uncertainty in the model. This would be useful not only for showing clients reliable predictions but also for prioritizing actions related to the results shown. This is why we are now researching Bayesian Deep Learning models that can measure uncertainty related to the forecast.

## Measure uncertainty to help clients

The detection of confident user behaviors for prediction purposes requires an analysis of the concept of the **uncertainty of the prediction**. However, what is the source of uncertainty? Although the clarification of this concept is still an open debate, in Statistics (or at times in Economics),it is classified in two categories: The **aleatoric uncertainty** (i.e. the uncertainty manifested due to the variability of the different possible correct solutions given the same information to predict) or **epistemic uncertainty **(i.e. the uncertainty related with our ignorance of what is the best model to use in order to solve our problem, or even our ignorance vis á vis this new kind of data that we were not able to appreciate in the past).

From a mathematical point of view, we could try to find a function based on certain input data (logged transactions of a certain user) that would return the value of the next transaction in a time series in the most accurate way possible. Nevertheless, there are limits to this approach: in our case, given the same information -past transactions-, the results are not necessarily the same. Two clients with the same past transactional behavior do not necessarily imply similar transactions in the future.

The following figure visualizes the concept of uncertainty and tries to answer the question of what the value would be of the red dot during a certain time interval.

In this case, a model with or without uncertainty would predict the following outputs:

In order to approximate this function, a good state-of-the-art solution for this problem is Deep Learning algorithms due to their capacity to fit very complex functions to large datasets. The standard way to build D Learning algorithms provides a “pointwise” estimate of the value to predict but does not reflect the level of certainty of the prediction. What we are looking for here is to obtain a **probability distribution model of the possible predictions by using Deep Learning algorithms**.

We can model two types of aleatoric uncertainty: *homocedastic* uncertainty, which reflects a constant variance for every client, and *heterocedastic* uncertainty, which accounts for the variance of each individual client given their transactional pattern.

In any case, the usefulness of the aleatoric Deep Learning against the epistemic Deep Learning is greater. In fact, we have detected a clear improvement in our ability to gain a measure of confidence of each prediction, especially because we can gauge the variance.

The type of uncertainty that we have taken into account is extremely useful when we have to create a new model that can forecast incomings and expenses. In the following plot each point is a user plotted with their own real error prediction versus the predicted uncertainty score. Furthermore, we see that most yellowish points (the majority of clients) have a low error and a low uncertainty score that allows us to ascertain that it makes sense to use this score as a confidence score

.The potential applications of this approach to Business range from being able to recommend measures to avoid overdrafts, create shortcuts to most common transactions, improve financial health, or identify unusual transactions in the transaction history. We could also provide the client with feedback concerning how certain our predictions are in terms of future occurrence.

One question that naturally springs up when imagining what Artificial Intelligence (AI) can bring to the banking industry, and one that we get asked fairly often, is: Can you predict people’s expenses? As it is often the case, such a simple question is in fact only *apparently* simple. The prediction of personal financial transactions may range from estimating the amount of your next electricity bill (a simple problem, in most cases), all the way to guessing the time and amount of your next ATM withdrawal (a seemingly impossible task, in most cases).

With all its inherent difficulties, anticipating the behaviour of personal accounts is a challenge that holds a special place in any modern platform of *intelligent* banking services. For this reason, we have been investigating the best way to tackle this problem. This week we are presenting some results in the Workshop on Machine Learning for Spatiotemporal Forecasting as part of the NIPS 2016 conference. Our contribution is entitled Evaluating uncertainty scores for deep regression networks in financial short time series forecasting^{1} . Here is a walk-trough:

## Keep it short

To use a jargon common to both statistics and Machine Learning, we were facing a time-series regression problem, something that, normally, any trained statistician would not even blink at. However, a detail made our problem setting perceivably harder than usual: Its *coarseness*. In the analysis of personal financial transactions, it often makes sense to aggregate data at a monthly timescale, since many important financial events only happen with a monthly cadence (think payroll), or a multiple thereof (think utility bills), or would introduce excessive noise in the data if considered at a finer level (think grocery shopping). Furthermore, we wanted to be able to predict expenses for *as many clients as possible*, including those having relatively short financial histories. This forced us to work on just a year-worth of data with monthly aggregations. In other words, for each client and transaction category, we only had 12 historical values. We had to guess the 13th.

While this class of problems has been widely studied, especially for long series, some of the best-suited statistical methods for these prediction problems, such as Holt-Winters or ARIMA, tend to produce poor results on such short time series. This led us to look at our problem from a different perspective, that of Machine Learning.

## If it flies like a duck…

Instead of trying to predict the 13th value in our series solely by looking at each of them individually, as classical time series methods would normally do, we adopted a common underlying principle in many Machine Learning algorithms:

If it looks like a duck, swims like a duck and quacks like a duck, then it probably is a duck.

This disarmingly informal statement (sometimes referred to as the duck test) is a humorous phrasing of a common methodology of machine learners: If a *target* time series is sufficiently similar to *training * time series whose 13th value is known, then the behaviour observed in the training series should help us guess how the target series will evolve.

We therefore started experimenting with classic approaches directly leveraging this principle, such as Nearest Neighbour regression and Random Forest regression. Worryingly, under most of our error metrics these supposedly cleverer methods were severely losing to much naiver solutions such as predicting by taking an arithmetic mean of the series, or simply by using the value 0 for all predictions (an effect of the sparsity of many financial time series). It was clear that in those short 12-values sequences, there was more structure than these methods were able to capture.

In order to squeeze as much information as possible from our short series, in a collaboration with the University of Barcelona we developed a deep neural network which employed Long Short-Term Memory (LSTM) units in order to exploit their power to *remember* past information, a property that makes them useful tools to learn sequential structures.

The margin of improvement was evident.The next figure compares error and success metrics of different prediction methods, including our LSTM network.

## An odd duck

While greatly improved, the error-metrics values were still high. Higher than you’d want for any client-facing application ready to be released into the world.

An often underestimated implication of the duck-test approach is that if a target series doesn’t really have any other series that looks akin, many algorithms will still stubbornly output a prediction – only, it will be *unreliable*. It goes without saying that an *uncertainty score* (or *confidence* value) accompanying a prediction is a highly desirable feature for an algorithm, but one that deep networks, including ours, usually lack (a matter of very active research).

In order to estimate the uncertainty of our deep-network predictions, we turned our attention to a few quantities we could compute that were reminiscent of the concept of uncertainty. For example, while computing the distance to the nearest time series was not the most accurate prediction method, the distance to the most closely-looking series can be considered a proxy of prediction uncertainty (duck test!), as can the reconstruction error of a marginalised denoising autoencoder or the distance between the *representation* of a series that can be found in the last layer of our LSTM network (an idea that proved greatly successful in the field of Computer Vision). To further add to our list of uncertainty-estimation methods, we also tried a supervised approach by training a Random Forest to learn prediction errors directly.

Finally, we applied the duck-test principle to prediction models, rather than to the data: If a target series gets similar predictions from models that have been partially corrupted by noise, this means that the original model is sufficiently confident of its knowledge of the series’s pattern. We implemented this approach by applying dropout noise on the trained network, hence bootstrapping 95% confidence intervals on our predictions.

For the benchmark of confidence calculation methods we used Mean Absolute Relative Error vs coverage (fraction of non-rejected samples) curves. Intuitively, this means that methods producing better prediction-uncertainty estimates keep the MARE low for larger proportions of the test set. Note that this allows us to filter out the series which the algorithm is not confident about, leaving us with a much cleaner dataset to use in production environments.

We compared the different confidence calculation methods we implemented with simple rejection baselines, such as estimating uncertainty only by looking at the absolute value of predictions or by a completely random value. Perhaps unsurprisingly, given its supervised nature, random forest regression achieved the best confidence estimation. Among unsupervised methods, the best estimator we evaluated is the distance to the n-th neighbour in the space of embeddings produced by the last dense layer of the regression network. We found it interesting that the density in the network embedding space is a relatively good proxy of the uncertainty.