What We Saw at ICDM 2017: Tackling Data Problems with Real World Examples

Maria Hernandez AI Factory was there

On November 2017, New Orleans (Louisiana), hosted the world most important conference on data mining. The 17th edition of this event presented new work on graph analytics, time series patterns and recommender systems, among others disciplines.
Traditionally, this conference has focused on scalability and practical examples and applications of algorithms, and this year did not disappoint when it came to real world examples. For BBVA Data & Analytics, where algorithms are applied in real, large scale datasets to solve real-world problems efficiently, this was an opportunity to get a better sense on the challenges and solutions in a wide range of industries and disciplines.

The big numbers:

  • 778 papers submitted, more than 60% student was first author. 72 regular papers accepted (9.3% rate) and 83 short papers. Total 19.9% rate.
  • 4 keynotes
  • 24 workshops

The conference is preceded by one day of workshops. This year, 11 full-day and 13 half-day workshops were running parallel in a centenary hotel in the heart of the capital of jazz, cajun and creole cuisine. The workshops included topics such as Sentiment Analysis, Network Analysis, Machine Learning and Spatiotemporal Data Mining; as well as Data Science applied to specific domains, as Human Capital Management, Healthcare, Cyber Security or Politics.

In the latter group, I found quite interesting DSHCM: Data Science for Human Capital Management. Several talks showed the use of relational data and graph representations to analyze human interactions and to infer missing information from entities connexions. As an invited speaker, Qi He from Linkedin talked about how their Linkedin Knowledge Graph has allowed them to analyze interaction between companies and people and improve the underlying information.

In High Performance Graph Data Mining and Machine Learning workshop, invited speaker Danai Koutra presented their work on Graph Mining at GEMS Lab in University of Michigan. I recommend to check the lab’s webpage to know more about their projects, but in particular, I recommend the work on summarizing graphs [1], and the time-evolving version [2]. These methods take a large network and extract common graph structures (cliques, stars, bipartite,…) from it. This allows to represent the original network in terms of a vocabulary of well-known structures that are related among themselves, and present the user with the most important information, allowing her to understand the macro structure of the graph.

Regarding the main conference, it was distributed in 4 simultaneous tracks, covering several data science disciplines currently relevant: Pattern and Text Mining, Recommender Systems, Sequences and Time Series, Deep Learning, Streaming and Online learning, Clustering, or Graph Analytics. It also presented a panel on Ethics and Professionalism in the age of Social Data.

Tracks aside, I saw four topics that lead the talks’ approaches: time series, graph analytics, scalability and public source in form of libraries. Many of the works, independently of the main paradigm they were addressing, used networks to model the problem.

Here my list of papers that I find worth to have a look. They propose interesting solutions to very frequent problems in our field:

  • Discovery of Action Rules at Lowest Cost in Spark [3]. An action rule is a recommendation that, if followed, a variable turns from False (or unwanted state) to True (or desired state). The authors present a Spark algorithm that extracts Action rules from a given dataset, considering there are forbidden actions -those that cannot be achieved- and they have different costs.
  • Spatio-Temporal Neural Networks for Space-Time Series forecasting [4]. The authors present a neural networks-based algorithm for time series forecasting that exploits the spatio-temporal relations among different series. A very frequent problems on TS forecasting when historical data is not available, but there are a huge number of related series.
  • Scalable Hashing-Based Network Discovery [5]. In neuroscience -and other fields- it is very common to build networks from correlated time series: nodes represent each time series and edges are usually created when two time series are highly correlated. The brute-force algorithm to select those pairs is very inefficient. In this paper authors present a hashing-based method to build that network much faster and getting a very similar outcome.

  • Dataset Construction via Attention for Aspect Term Extraction (ATE) with Distant Supervision [6]. Annotated data is very frequently an issue when building statistical models. In particular, there is a small amount of available datasets for supervised Aspect Term Extraction and they cover only a few domains. This leads us to the need for creating new datasets to train the models. The authors have developed an algorithm that exploits annotated data to construct new annotated datasets for the ATE task. They use Attention Models to identify sentences that are very likely to contain relevant information for the task.

Overall, I highly recommend attending this conference to anyone dealing with real world data problems, as it helps to get to know the latests results on sophisticated, and efficient large-scale data algorithms.

Hope you enjoy the readings!

[1] http://www.mlgworkshop.org/2016/paper/MLG2016_paper_29.pdf

[2] http://web.eecs.umich.edu/~dkoutra/papers/Timecrunch_KDD15.pdf

[3] http://ieeexplore.ieee.org/document/8215648

[4] http://ieeexplore.ieee.org/document/8215543/

[5] http://web.eecs.umich.edu/~dkoutra/papers/17_scalableNetDiscovery_ICDM.pdf

[6] https://arxiv.org/abs/1709.09220