Predicting the outcome of a game or of the whole competition that these days takes place in Russia is difficult, since so many unforeseeable off-model events can occur within each of the 90-minutes games of these 4 weeks.
In order to help those who try to use some combination of data, football instinct and innate inclination to invest money unwisely to predict the exact score of a game in a pool betting (informally called porra in Spain, or polla in some parts of Latin America) we have developed a visualization that might help understand the possible outcome of different matches.
Many considerations play a role in the long and winding road to international football history. A single player -the cunning Maradona, the energetic Pelé, or the elegant Johan Cruyff- can change the football fate of an entire country; a single mistake can also put a very well groomed squad down into his knees against all odds -North Korea beat Italy in 1966 and came second of its group-.
Predicting the outcome of a game or of the whole competition that these days takes place in Russia is difficult, since so many unforeseeable off-model events can occur within each of the 90-minutes games of these 4 weeks. In order to help those who try to use some combination of data, football instinct and innate inclination to invest money unwisely to predict the exact score of a game in a pool betting (informally called porra in Spain, or polla in some parts of Latin America) we have developed a visualization that might help understand the possible outcome of different matches.
We have used our client2vec algorithm, which creates embeddings to represent a client's behaviour and produces a categorization that goes beyond sociodemographic parameters. In this case, we extracted the percentage of backward, central or forward game for each of the players by looking at the positions of those with similar playing style. In doing so we leveraged client2vec’s ability to achieve a consistent balance between precision and recall. In this context, this means client2vec can help us accurately define a player’s position by spotting a relatively small number of similar players. Furthermore, we extracted the behaviour in the field of entire teams by combining the playing style of their players (our analysis includes the entire roster and does not discriminate between first team or the rest).
We have trained player2vec (how we have renamed our algorithm) with the features gathered during the last 3 years of every participant in the most important global football competition, including goals, dribbles, successful passes, or minutes played. If used to assign a player to a single position, player2vec allows us to reach an accuracy of around 70%, already improving over the use of the raw playing statistics for the same classification task. Past experiences with client2vec showed us that the algorithm could successfully spot behavioural similarities even for data points whose labels would differ, which is why we chose to evaluate how much a player’s game was forward, central or backward rather than sticking to a simple classification task.
We have added a measure of competitiveness that allows us to gauge the performance of a player by the matches he played in the last 3 years. If the player has been playing in high level leagues (such as Bundesliga, La Liga, or Premier League), he will have a higher competitiveness score than someone who played in minor competitions. This score is far from ideal, because it consider equally all the teams in the same competition (playing for Numancia has the same score as playing for Barcelona FC).
In order to add a bit more dimensionality to our data we have also extracted a measure of the player's value as perceived by the market. This has been done using fictitious market value data processed by the German site Transfermarkt, one of the most popular sites of its kind, with thousands of players statistics.
By aggregating the data of all their players (including those who are normally substitutes) we have been able to understand where a team plays more comfortably. We have determined by looking at the data, that traditionally strong national teams have a component of defensive-center players and moderate forward players -the big football stars, such as Neymar, Messi o Ronaldo, are forward-leaning but retain an important midfield component-.
In general, all the teams that are competing this summer for the the title of World’s best replicate that defensive-center nature, but within that there are remarkable differences.
Germany is one of the most balanced teams in this summer world competition and still this was not enough to be defeated. By analyzing players data and comparing it rival such as México and Sweden we can spot a lack a more polivalent defenders in Germany, despite being one of the most expensive teams and facing less valuable, but yet more evenly distribute, teams. We can also visualize an uncommon distribution of players on the base of the triangle in which we project our three dimensions.
Spain has a squad that loves playing in the center of the field (with a 32% component, the most of every team). Russia is the most forward-leaning team with a 35% of the forward component), similar to Belgium.
Mexico is one of the most balanced (with a 26% of forward and a 25% of center), meaning, that they could accommodate a team to a versatile way of playing, similarly to France or Uruguay.
In the defensive part of these three dimensions we can find Brazil (with a 51% of back component). This could mean that most of their players displayed a defensive-middle style, while a few, very effective players, such as Neymar, Coutinho or Jesus, lead the squad in the forward side.
We filtered results by countries that continues at the competition, just when final stretch it's going to start. Eight countries with different game styles, whose defensive, central and ofensive factors you can check at the following interactive.
As you can see in the visualization above there are teams higher up in the Y axis depending in the total money their players accumulate. If we take this into consideration the favourite teams should be Brazil or England. But money is not everything: the reality has shown us that the value of players means nothing if there is no cohesion or a smart leader behind the squad. The competitiveness score we have created for this story (which is based the overall score of a country’s competitions as compute by FIFA) shows that in the aforementioned we should consider countries like Uruguay.
We think so. By looking at the forward, midfielder or defensive character of a player and comparing them to the neighbors that have reached the top of the market we can find some players that could be key to carry their team to the final stages of the World Cup.
According to the characteristics of its players and the level of the competitions in which they play there are a few players in this World Cup that could be considered rising stars and their performance can take their teams beyond what is expected of them. This could be the cases of Michy Batshuayi, of Belgium, and France Florian Thauvin could become the secret drivers of their team’s success.
We can also play a game of impossibles and imagine what would happen if we replace a whole national team with players of similar playing styles. Would we get the same results in the same hypothetical game (we will never know).
Monreal: Dendoncker (Bélgica), Azpilicueta: Fagner (Brasil), Busquets: Laxalt (Uruguay), Thiago Alcantara: Guardado (México), Ramos y Nacho: difíciles de comparar Saúl: Paulinho (Brasil), Odriozola: Thauvin (Francia), Carvajal e Iniesta: difíciles de comparar Koke: Augusto (Brasil), Lucas Vázquez: Lemar (Francia), Rodrigo: Januzaj (Bélgica), Isco: Dembelé (Francia), Asensio: Griezmann (Francia), Aspas: Rushford (Inglaterra)
Disclaimer: This data story has been developed by a team formed by a Data Scientist (Leonardo Baldassini), a visual storyteller (Iskra Velitchkova) and a communicator (Jairo Mejía) with little to no knowledge of international football. We have only used data that could be potentially enriched to return better results. All these insights have been extracted exclusively from data analysis and for research purposes only.