It’s not unusual for people to complain about the timeliness of macroeconomic statistics. Governments are constantly revising GDP figures and unemployment rates well after the fact. It seems when it comes to economic statistics it’s tough to make predictions about the past. Other limitations are that the nature of the data that is used as input to official government economic indicators is limited on the geographic granularity and it is rather static when tracing changes over time.
A better way of doing this analysis was explored in a paper that won Best Paper Award for the 2015 data science conference of the Academy of Science and Engineering. The paper, “Predicting Regional Economic Indices Using Big Data Of Individual Bank Card Transaction”, by a group of researchers from MIT Senseable City Lab and co-authored by our Head of Urban Analytics Juan Murillo explored the hypothesis that common macroeconomic indices could be predicted by using data from credit card transactions collected by BBVA. If correlations could be found with generally accepted economic indicators, that would open the possibility of using this type of dataset for deeper economic analysis than we are currently getting. Possibilities could be using the dataset for analysis that is more timely, more granular, and able to account for variables that can’t be captured in today’s figures.
We have created a dataset derived from transactions with BBVA credit cards, debit cards, and all credit card transactions on BBVA point of sale terminals. This provides an embarrassment of riches with regards to economic data, which leads to the challenge of deciding what data maps well with predicting which economic indices. Six socioeconomic indicators were chosen because they are widely used, easy to measure and good indications on quality of life:
- housing price level
- unemployment rate
- crime rate
- percentage of higher education
- life expectancy
Preparing the BBVA data so it was in a format that could be correlated to these economic indicators was an important first step. First, since the data is from BBVA and BBVA has different market penetration in different areas, the data had to be adjusted to account for this and avoid any bias. Secondly, we are dealing with data of many types, currency amounts, percentages or simple counts. The data was normalized so we could carry out meaningful comparisons.
Our work and methodology
When data is normalized in this manner logistic regression is a common technique to use within the method of supervised machine learning. The purpose here is to find which of our variables can predict our indicators. We started with 35 variables, which was still a rather unruly number to work with. Many variables might appear to correlate with an intended prediction, but in truth one variable may be important to the prediction, and other variables correlate to that one variable, thus possibly introducing more noise than clarity. The team used Principle Component Analysis, a widely accepted machine learning technique, to reduce our 35 variables to 16 components, and later in the experiment they were able to reduce this to 6 components.
After all this preparation our team was ready to get to the fun part and see if our hypothesis worked. In supervised machine learning, data is separated into training and testing data sets. The training set is used to help the computers find the right algorithm and the testing set is used to validate that the algorithm works in new situations. The team choose 34 provinces of Spain’s 52 provinces as the training sets and the remaining 18 as the test sets in various combinations over several iterations.
The model worked very well, although not perfectly. Five out of our six indicators showed a high degree of correlation. Only predicting the crime rate did not fit into our model. This appeared to be due to some outliers the model just could not account for. However, GDP, housing, employment, education and life expectancy could all be predicted with our model.
Ok, so now we know that we can reach conclusions that we already knew, but in a more timely fashion. This opens up the question of what can we learn that we couldn’t know before? The BBVA data possess unique attributes that make insights possible, insights that are not achievable with current indicators. Official statistics can describe the situation on a provincial level, but with our approach, it’s possible to carry out the analysis on city-wide or even neighborhood level. BBVA data can also be tracked by time of day, day of week, time of year allowing cyclical time series analysis to be done on very low level, not available before.
The Economist magazine reported last year that Big Data’s effect has mostly improved the work of microeconomists and in turn raised the prestige of microeconomics in comparison to that of the traditionally more active macroeconomists, who have up till now tended to neglect advances in the Analytics field. We believe our work is an important step in bringing Big Data to the macro level.