The best of both worlds

I have been listening to the new BBC podcast hosted by Steve Pinker ‘Think Twice’. The podcast features Sig Mejdal former NASA engineer and assistant manager of the American baseball team Baltimore Orioles. He is talking about how algorithmic knowledge and expert knowledge can be combined in sport. Talking about his experience with baseball, he says:

“I think any decision maker would say ‘I want the best of both worlds’ but then they stop talking. The interesting part is how exactly are you going to combine them.”

Golden Guts

He then proceeds by telling the story of a model he developed to project college player performance into major league. He talks about a time in which he ran that model and of all the thousands of college players a guy comes out that played for the baseball team at Stanford University. Living just a few blocks away Sig goes and watches their game. He notices on the pitch an undersize no.4 that looked like was given a place in the squad only because Stanford ran out of scholarships. When he matched the name with shirt number he realised that was the guy his model recommended. He describes how bad he felt for the embarrassment that this result could provoke if the guy was not as good as the data suggested. He mentions the episode as an example of what he calls the ‘perfect dichotomy’ that there is in sport with the scouts using their senses and scorers using data. Scorers by the way are those involved with the creation and maintenance of statistical data about baseball who then became known as sabermetrician i.e. those who like Sig Mejdal use the data from the Society for American Baseball Research (SABR) to evaluate (i.e. meter) players. He recounts that in the end despite his data he couldn’t convince the guts of the management to hire player number 4. What Sig Medal describes as guts (aka ‘golden guts’ crf. Thomas Davenport) are domain experts’ prior beliefs of how a baseball player should look like.

Clinical versus Actuarial Method

Mejdal continues by giving an interesting account of the epistemics of the sport industry in how data-driven insight is used: “The industry I am in is a little different than the clinical versus actuarial method [….]”. I went on to look up what the clinical versus actuarial method was. It seems it is something that behavioural scientists researched on since 1954 when Paul Meehl published an article with the same name. Simply put, clinical is when the human expert decides (clinical because most experimental evidence comes from clinical studies where the expert is a clinician), while actuarial means statistical methods or any mechanical method that ‘eliminates’ human judgment.

The world of broken legs

Mejdal’s account about coaches, managers and sporting directors wanting ‘the best of both words’ begins like that of many of my football data analyst informants. It also confirms what I found in my research about the workflow of sport data: data driven results are returned to the domain expert (the coach or the director) who take it into the decision room.

“[….] after we squeezed all we can out of the number the scout expertise still adds predictive ability and viceversa. It’s just an exercise in combining info to give what I think is the base rate and now it’s up to the decision maker to stray from there.”

However, he immediately argues that mechanical and clinical methods shouldn’t be combined:

“these are two different methodologies. You cannot both accept and reject a player.”

He cites results from behavioural studies literature that humans judgers who are given the prediction of the algorithm and can combine it with judgement in any way they want often do worst than the algorithm alone. “And that’s the world of broken legs. It’s going to be human nature to stray too much. And so what we have often done is keep track of these strains and these supposed broken legs and we look back retrospectively if you are 1 for 10 with the broken leg sensing you may want to reconsider and just be more conservative in that because I think we are going to undo much of the analytics if we are looking for the feeling in our gut to be satisfied before we make the decision.” Mejdal critical turn on the use of data in sport reveals what to me is the true nature of data scientists’ ambition of a ‘perfect dichotomy’ between data driven and expert driven analyses: human judgement shouldn’t be supported: it should be eliminated .

Clinical/Actuarial vs Accuracy/Interpretability

Listening to Sig Mejdal on podcast discussing the clinical vs actuarial method reminds me a related but separate debate in data science: the debate on accuracy versus interpretability. As famously observed by Breiman —a classical statistician turned data scientist—there seems to be in machine learning a trade-off between interpretability and accuracy. Traditional statistical methods such as linear regression provide a picture that can be understood by many outside the AI field of the relation between response and predictor variable. However, statistical methods are less accurate when it comes to identifying a good predictor. Talking about a real-world example analysing the speed of trials based on court data systems, Breiman reports that when using decision trees—a highly interpretable form of machine learning— results are immediately apparent to the non-expert audience (in his case an assembly of judges). According to Breiman, the interpretability of the decision trees is in the fact that the assembly immediately starts to comment algorithmic results saying: ‘I knew those guys were dragging their feet’. In another note from the same paper when discussing another more inscrutable machine learning method called random forests, Breiman confirms that in his experience despite successes in prediction in the medical sector, given that doctors can interpret logistic regressions and not 50 decision trees hooked together in random forests, in a choice between accuracy and interpretability the medical professionals will always go for interpretability. While the clinical vs actuarial judgement literature predates state of the art machine learning artificial intelligence applications to make algorithmic judgement, when read side by side with the current accuracy vs interpretability debate it raises an aspect that is too often forgotten. The point shouldn’t be whether simple methods such as linear regression are more or less accurate than more advanced AI but which one helps human judgers to perform better. That if data science’s real priority wasn’t to simply identify a good predictor …

How much improvement does human judgement need?

Looking for an alternative agenda to that of eliminating human judgement and embracing the interpretability view of data science as helping human judgement, a different takes to the ‘best of both worlds’ argument becomes possible. First let’s have another look at Medjal’s broken leg argument. As opposed to looking at it as a half empty glass, if data analytics could provide coaches and sport directors a base rate for their decision making, that could in part help human judgement. For example a data driven base rate can at least improve the well known fallacy of human thinking consisting in the use of the anchoring and adjustment heuristic. Behavioural research literature suggests that human judgers start an estimate from an initial value (the anchor) then adjust toward a final answer. The problems of the anchoring and adjustment heuristics are two. One is that people often use an irrelevant or weakly relevant initial value. Then there is the problem of insufficient adjustment i.e the world of broken legs Sig Medjal is talking about. Intuitively, having a data-driven base rate would solve at least part of the fallacy of human thinking i.e. starting an estimate from an irrelevant value. Mejdal point is that human adjustments to the base rate will “undo much of the analytics” and that “decision makers should be more conservative with the use of the broken leg sensing”. Perhaps his analysis does not take sufficiently into account the role of intuition and confidence in elite performance and sport. Using more analytics and being conservative are not always the best strategies, especially if we consider the role that overthinking plays in the widely-recognized phenomenon of choking under pressure across a range of sports. Another sector-specific factor is competitiveness. Interestingly Mejdal touches upon the fact that the innovation brought about by applying mathematical models to the game that was initially taught as producing giant rewards once everyone else adopted it just disappeared in a very short time. Golden guts and heuristics are perhaps widely adopted in elite sport business exactly because they are seen as less replicable than mathematical models, thus allowing clubs an advantage towards opponents. Embracing the interpretability argument means also looking at the ‘best of both worlds’ position by admitting there is always a clinical element into the mechanical judgement, a subjective into the objective. Humans are inextricably entwined in developing algorithms and in many cases provide the expert knowledge of what cues should be used. These considerations are not meant to undermine the benefit of mechanical models. Interestingly, there is evidence that even when a mechanical model is constructed from clinical ways of combining information for example the above mentioned anchoring and adjustment heuristic – it already outperforms human judgment for the simple fact of being consistent, returning the same decision each time. On the contrary, human judgement has a tendency to be inconsistent. Indeed when presented with the same information on different occasions, humans often draw different conclusions. In other words, when it comes to executing, taking the outputs of the mechanical model gives a better outcome. As opposed to trying to eliminate it, a better take on the ‘best of both worlds’ argument would be to admit there exist simple mathematical models that improve clinical judgement by making it more consistent while at the same time, for the very reason of being based on it, provide a high degree of interpretability i.e. human judgers recognise how models ‘think’ and are therefore more inclined to use its results. Why looking further? What is the bias in the mechanisation of clinical methods that needs to be resolved by inscrutable AI applications that fundamentally aim to eliminate the methods currently in use to take decisions in the respective fields? How much improvement does human judgement need? If domain experts still struggle with absorbing the advantages of simple models, how (and why) should they come to grips with state of the art AI applications?