The Difference Between Irrelevant Statistics, Context & Actionable Data

20th August, 2019.

Email: Sportsanalyticsadvantage@gmail.com


Think about the last cricket statistic that you read, or listened to, via the media.  I’d wager that there is a high chance that if you thought about it, you’d find it very difficult to find any way of being able to act upon that statistic.

Some of the numerous statistics discussed in the media might focus on the length of a player’s innings.  For example, player X’s innings was the fourth longest by an England number six against Australia on a rainy day.  Ok, so I made up the rainy day bit, but the rest certainly applies to regular statistics mentioned in both the media and on social media.  When I read a statistic like this, my immediate reaction is along the lines of ‘that’s nice’.  In reality, that’s all it is - it is utterly irrelevant - what future benefit can we derive from knowing this information?  The answer of course, is very little.  

Cricket is full of these largely meaningless statistics.  These even extend to the scorecards for matches - why, for example, are maiden overs recorded for a T20 bowler, when they’re as likely as finding a needle in a haystack?  Why are the minutes of a player’s batting innings recorded, when it has little relevance, or even any interest to any reader of the scorecard?

Performing a survey of responses would almost certainly yield an answer relating to tradition, or ‘that’s how it’s always been done’.  And that’s it, in a nutshell.  Because someone, many years ago - probably hundreds of years ago - decided that a scorecard should comprise certain elements, that’s how they’ve continued to be.  

The problem with a lot of statistics is that they are either irrelevant - see above - or lack context.  In this recent article, when discussing his retention as India head coach, Ravi Shastri was quoted as saying 'Data is something that got crunched a lot.  For instance, over the last two years, India have won 71% of their matches across formats'.  

On the surface, this appears to be a relatively impressive percentage, and certainly, if it was the win percentage of a franchise T20 head coach who solely worked in leagues featuring teams with equal budgets, it would be superb, and worth mentioning, unheard of over a decent sample size of data.  However, with the financial disparities between nations, and the vast talent pool that India boast, the statement lacks a little context.  From it, it is impossible to quantify whether this is actually a good or bad win percentage - an ideal, hypothetical, contextual response perhaps would have been 'India have won 71% of their matches across formats - this is 6% above the 65% mean expected win percentage established by three independent statistical modelling companies'.  This would then immediately establish that this win percentage was indeed, excellent, and provides concrete context relevant to the discussion.

A good example of a contextual stat would be the speed readings shown on the media for Jofra Archer in the Second Test against Australia, and how his spells and peak speeds read compared to other English fast bowlers in recent years.  Any viewer would be able to see he was bowling fast, but perhaps would be less sure about how fast.  Providing this data relative to previous fast bowlers Harmison, Flintoff and Finn, for example, allows long-term cricket followers to understand quite how fast he was bowling.  

However, while providing context, this still doesn't provide any real way of being able to act upon it - it's not particularly actionable.  

Recently, there has been a debate about whether Rishabh Pant or Shreyas Iyer should bat at four or five for India in ODIs.  Initially, Pant was backed by the team to bat at four, with Iyer at five.  Then around a week afterwards, coach Shastri was quoted as saying 'Shreyas Iyer, for instance, he is going to stay at No. 4'.  At the time Pant was given the nod to bat at four, I tweeted suggesting there was a decent argument to reverse these positions, while not being in favour of fixed batting positions in any case.

We can, in fact, use actionable data to make a better case for Iyer at four and Pant at five.  In 50 over cricket, a general strategy is to bat solidly (although not necessarily slowly) through the middle overs prior to attacking in the death overs.  Even England, the most attacking ODI batting unit in the world currently, have Ben Stokes at five (capable of playing a solid or spectacular innings), Jos Buttler batting at number six and often, another strong boundary-hitter in Moeen Ali at seven - they understand the value of excellent boundary-hitters coming in during the last 10-15 overs.  A look at Iyer and Pant's data, compared to the mean for World Cup teams against each other in ODIs from 1st January 2017 onwards - actionable and contextual data - suggests that there is a clear argument to bat Iyer at four and Pant at five (or even six), if we assume that fixed batting order:-


Player

% Boundary Runs

Boundary Percentage

Non-Boundary Strike Rate





Rishabh Pant

57.64

13.08

47.09

Shreyas Iyer

50.29

11.93

59.72

World Cup Teams Overall (Against Each Other)

47.85

9.71

50.96


This simple table shows that not only is Pant a better boundary hitter than Pant (higher % of boundary runs, and higher boundary percentage), he's also far in excess of the average ODI player for recent World Cup teams in matches against each other during the same time period.  If you were designing a cricket managerial computer game and had to give him a specific role it would surely be 'boundary hitting batsman'.  Pant's non-boundary strike-rate - a measure of how good a player is at rotating the strike - is lower than the ODI average for World Cup teams, and considerably worse than Iyer's, again questioning his viability over Iyer in those rotation-heavy middle-overs.  In short, this actionable and contextual data clearly indicates that Pant is better in a role where he's got a licence to hit boundaries (and less onus on him to rotate the strike), while Iyer - a high level rotator - would be better in those middle overs.

As for a fixed batting spot, it really depends.  Pant would be great at number four if he's coming in at say, India being 200-2 with 15 overs left.  It would be logical that Iyer would be the better choice at four if he came in when India were 50-2 after ten overs - again illustrating why fixed batting orders are generally a bad idea.  Virat Kohli has suggested that number four and five are 'floating batting positions', which is an approach which makes sense.

Moving back to irrelevant, statistics, Michael Lewis’ excellent book, Moneyball, referenced the influence of cricket in box score adaptation in Baseball, and how this eventually caused a lack of understanding of how to assess the individual contribution of players.  In similar fashion, cricket scorecards have also failed to evolve, particularly in the shortest major format, T20.  

Removing maiden overs and minutes of an innings from a T20 scorecard and player statistics databases would be just the start.  Including dot balls (for batsmen or bowlers) plus fours and sixes, conceded by a bowler, and including these for batsmen, would be much more useful, as would batsmen v bowler match-ups for each individual match as standard (so that they can be used as a guide to batsman abilities against various bowling types) plus phase data for batsmen and bowlers alike.

Further misleading is the lack of splitting records of varying standards of cricket.  For example, an English player’s first-class data will span either, or both of Division One, or Division Two cricket.  However, in the record books, a batsman averaging 40 exclusively in Division Two somehow looks the same as a rival batsman averaging 40 exclusively in Division One - there is no split in Divisions recorded.  

This is like saying that a footballer who scored 30 goals in 40 matches playing in the English Championship and then scored 10 in 30 in the subsequent season in the English Premier League has scored 40 goals in 70 matches across two seasons - of course they have, but the majority were at a lower standard, and such data is utterly misleading when assessing future expectations in the higher league.  If even a computer game, such as Football Manager, can record such nuances across individual player data, why can’t cricket record books and websites?

More recently, some individuals, companies and the media have also started to discuss particular batsman v bowler match-ups.  This might be referenced as follows - ‘Batsman X has faced 40 balls against Bowler Y, scoring 50 runs and being dismissed four times’.  While this type of data is more useful than recording maiden overs for a T20 bowler, or a T20 batsman’s innings duration in minutes, assuming such conditions will persist still contains an element of danger, given the small sample sizes of data usually manifesting themselves.  Many sample sizes used and quoted in the public domain would be unlikely to pass any form of confidence testing from a statistical perspective, while in some cases, the people giving the stats don't even mention the sample size of the data.

Perhaps more useful than an individual batsman v bowler match-up would be a discussion on how that batsman performed against a given bowling style.  We might be able to get a decent sample of a T20 regular batsman against right-arm pace, for example, although less common bowling types are often still exposed to smaller sample sizes.  In many cases, perhaps the best option is just to separately classify a player’s ability against spin or pace bowling - at least we can rely on a fairly reliable sample size of data for many players.

Compare the previous statement, ‘Batsman X has faced 40 balls against Bowler Y, scoring 50 runs and being dismissed four times’, with the following statement - ‘across various T20 leagues worldwide, Batsman X has faced 1000 balls against pace bowling with a strike rate of 150, and has faced 1000 balls against spin bowling with a strike rate of 100’.  

The benefit that the latter statement has is that it has some actionable data which is of a reasonable sample size.  We can now start to consider that Batsman X has a strong bias towards pace bowling - his strike rate is in excess of the worldwide mean for any T20 league - while having a weakness against spin bowling, with a 100 strike rate clearly being poor in any T20 league.

With this in mind, teams across different leagues around the world will be able to assess that particular player's value, which will vary based on the expected conditions.  Teams who can understand and use this data can immediately reduce the likelihood of a bad piece of recruitment, while improving their chances of making signings of higher expected value.


If this article has given you insight into the data that Sports Analytics Advantage can offer cricket franchises around the world in formulating team strategies, draft or auction plans, or any other work, please feel free to enquire at sportsanalyticsadvantage@gmail.com.

Comments