The Science of Success
Through reading the paper “A public data set of spatio-temporal match events in soccer competitions” I landed on the page of the latest edition of the “Workshop on Machine Learning and Data Mining for Sports Analytics” originally planned to take place in Ghent, Belgium. That’s the seventh edition: meaning it has been around since 2013. Compared to the more senior MIT Sloan Analytics Conference – dating back to 2007 – this workshop seems to have a different focus. While MIT Sloan wants the real impact on sport – along the same lines of “Moneyball” the Book of Genesis for sport analytics released in 2003 – the Ghent workshop seems to suggest the case for an academic impact of sport analytics. For example – quoting from the website – the workshop aims to show the benefits of machine learning with respect to statistical techniques that are said to be dominant in the field (indeed the ‘statistics in sport’ section of the American Statistical Association has been introduced in 1992 and by the way of course the Dixon & Cole model of genesis – dated 1997 – uses bayesian statistics). It also seems to suggest that football data can be used to generate insights that go beyond the specific domain: complex systems, human mobility to quote the keywords used by one of the authors Luca Pappalardo who also speaks about sport analytics a sub-field of an all-encompassing science of success. I would personally agree with that – keep reading. (Btw, I remember having animated discussions with friend data scientist Franz Kiraly on how to define success when I organised the Turing Data Study Group with PlayerLens). It is also fascinating to see sport analytics picked up as a case study alongside migration, societal debates and poverty in the EU funded SoBigData project. One of the features of sport analytics as a domain for data science is its immediately public dimension, compared with the other case studies of the project. Data science is a science of success also in a sense that bleeds into its epistemological dimension. Knowledge in data science is legitimate as long as it’s successful i.e. it wins data science competitions and/or it makes teams win games. How much better would it be if the examples used to illustrate the success of predictive analytics could stem from football rather than beers and diapers? Where are these examples by the way?
Where are these success examples by the way?
Last week I had the privilege of a having a conversation with Emanuele Massucco, co-author with Luca Pappalardo of the open dataset on football data discussed in the paper. Emanuele is product developer at Wylab, a football analytics company that makes precision tagging its main selling point. Having the media as their main client, Massucco describes main competitor Opta as having precision less of a concern than real-time tagging. The other difference is that Wyscout covers 380 football competitions across the world. Compared to the few major leagues covered by Opta, this makes Wyscout database irreplaceable when it comes to scouting. In our conversation, he sets beating the next opponent as the answer to the question why to use Wyscout data and before recruiting better talent. However, the examples he gives in the full hour of our conversation from dashboards Wyscout recently produced for clubs are all from the transfer market. I asked him why in that order then. Other examples of success stories that come to my mind are all from the transfer market – see Billy Bean … have a look to Midtjylland as well. Providing a data driven understanding of the game is not top of the list only in Massucco’s talk – who appropriately shows up in the conversation in a Wyscout tracksuit. It is also top priority in the job description for an AI Scientist positions recently advertised by Manchester City to work with their Data Insight and Decision Technology team. Key outcome number one is to research & develop AI models that will evolve the tactical principles utilised by teams across the City Football Group. Why is that so important for data science to be able to produce tactical nous?
Massucco as per earlier conversation also talks about data intergration: tracking data, fitness data and ball event data; internal data and data from different providers. This confirms other evidence from my fieldwork that the idea of integration is well-rehearsed in industry parlance. Many in football analytics address the relationship between data science and other aspects of football knowledge as complementary. However I do not think data scientists are prepared to accept this. The fundamental principle of unsupervised machine learning is how far can we get without knowing anything about what’s in the data. That is why data science is poised to have more success in the transfer market or betting where data integration is not as big an issue. Integration is still a barrier also on the football side of things. See for example the recent jobs advertised by Man City I spoke about earlier: no mention of football knowledge or of a need to come to term with it, neither in the essential nor in the desirable. Integration indeed also means collaboration: working together from different perspectives on the data. The public imaginary solicited by the press is and has always been since the inception of sport analytics with sabermetrics that of a computer that replaces the coach. That creates the opposite reaction of being derided by the real football man Harry Redknapp when analytics is not effective. Beyond epistemological divides and the technical problems of synchronising IDs across databases, in a very uneven industry such as the football industry there are more mundane barriers to data integration. One of the match analysts I have been in conversation with having worked for many years in a Serie B club – Serie B is 2nd tier of Italian football – says that in lower leagues, when the fitness coach or the video analyst leave the club, all the data follows them. This is one problem that has been addressed by top clubs (Juventus and Inter that I know) by creating their own internal research lab. I wonder whether this is a trends towards additional data secrecy or if the ingress of more data scientists will lead to creating more public football datasets.