As data scientists we love participating in various initiatives outside the scope of our daily jobs. This gives us the chance to learn new things that are not directly related to our field of expertise and take a fresh look into complex analytical problems. At the same time, these kinds of experiences allow us to collaborate with colleagues who normally work in separate projects and hence enrich our network of contacts. Last year, we participated for the first time in the FEIII 2018 challenge. The experience was so rewarding that we have decided to collaborate with the University of Maryland in organizing this year’s challenge as well.
What does FEIII stand for?
The Financial Entity Identification and Information Integration (FEIII) is a challenge hosted by the workshop “Data Science for Macro-Modeling with Financial and Economic Datasets” (DSMM). This workshop is held in conjunction with the SIGMOD Conference. SIGMOD is one of the most well-known conferences in the field of database management.
Over the last few years it has increased its scope to the application of machine learning to database management problems, and end-to-end machine learning.
The goal of the DSMM workshop serves two purposes: Firstly, to extract useful insights from financial data. On the one hand, there are multiple open data sources ready to be used for this purpose. On the other hand, there are multiple key industry players and government organizations interested in these insights. There is enough room to make important contributions to this field.
Secondly, the workshop aims to find the most appropriate methods for dealing with this task, and attempt to build a benchmark of different approaches. This way, the same methods can latter be extrapolated to other data. This particular purpose is very useful for BBVA Data & Analytics, where we deal with very different data in terms of privacy, language, and features. However, we too face similar challenges when it comes to cleaning and integrating separate data sources or building a financial knowledge graph.
The way DSMM tries to achieve its goal is to gather a community of people both in Academia and Industry and encourage them to collaborate. This is where the FEIII challenge comes into play: it involves organizing a long-term challenge (lasting over a month) so that more than just preliminary approaches can be used. However, organizing a challenge is by no means straightforward.
What are the difficulties in organizing the FEIII challenge?
One of the main difficulties when trying to put different people together to work towards the same goal is the data. Because of privacy policies, data cannot be easily shared. Therefore, as the starting point, the FEIII challenge tries to focus on public data.
This year we are lucky to count on Enigma to provide a terrific dataset, full of economic signals and analytic challenges. Only recently, Forbes described Enigma as a company providing free curated public data. Its ability to make rapid sense of this data and link it to private data has attracted some of the world’s leading companies, from BlackRock to PayPal.
The Dataset: U.S. Customs and Border Protection’s ‘Automated Manifest System’ (AMS)
This year the challenge is based on a comprehensive dataset of the bills of lading header information from the U.S. Customs and Border Protection Agency’ Automated Manifest System (AMS), for incoming US shipments in 2018.
This dataset provides a fascinating look into the U.S. commercial trade, and therefore a huge part of the world trade. It provides information on goods that arrive at U.S. ports on containerized shipping from all over the world. With more than 16 million records for the first half of 2018, it is also a test for your data processing skills.
You might want to have a look at some of the insights that Ben Matheson has provided in this visualization.
The Challenge: Mapping trade
The AMS dataset is rich with both macro-economic signals and microeconomic information on exporting companies. In order to please every Data Scientist at FEIII 2019, we have designed the following two tasks:
- A SCORED Task will focus on finding exporters for a given product and country. Such reference datasets have significant commercial value, i.e., exporters usually target customers for a financial services company.
- An OPEN task that targets the creativity of the participants and may answer interesting questions such as:
- Summary of trends; visualization of flows; outliers.
- Given an industry sector, characterize the most significant products, sources, and ports.
- Given a product, identify potential bottlenecks including sources and ports of entry.
- March 10: Release Datasets.
- April 22: Abstract submission to DSMM Workshop.
- May 1: Early registration deadline for SIGMOD 2019 and DSMM.
- May 15: Scoring of participant solutions.
- May 31: Camera-ready short paper submission to DSMM Workshop.
- June 30: DSMM Workshop.
Still not sure about participating?
Have a look at the data! Try to find your favorite food or wine in the dataset browser provided by Enigma.
You will discover how real datasets challenge real data scientists. Interested in unsupervised data cleaning, graph analytics, record linkage or collective text classification? How they scale to millions of records?
We look forward to receiving your submission by May and to welcoming you to SIGMOD in Amsterdam!