Revisiting Stats stabilization

a warning before you start reading this. You can find a more polished version at Nylon Calculus (memo to myself: add link here as soon as you got it). They also published another piece of mine and have a lot of other great stuff. But if I would draw a Venn Diagram of people that read my blog and people that read Nylon Calculus, I am pretty sure that you know all this already...
This version has a bit more (probably boring) details on why I find previously used methods impractical. It also has a bit more shiny plots, which in the end where not helpful for understanding. So, if you are here for the shiny plots scroll down to the end. There is also an R script so that you can produce shiny plots yourself. You can find a github for the R function that I wrote here.
Side note: One reason for this blog entry is that I'm starting to move from Matlab to R. If you find technical flaws in it let me know. :)

Hello everybody,
over the last years there seems to be one main way to estimate the stabilization of a stat, ( http://nyloncalculus.com/2014/08/29/long-take-three-point-shooting-stabilize/ , http://www.fangraphs.com/blogs/stabilizing-statistics-interpreting-early-season-results/ , http://www.baseballprospectus.com/article.php?articleid=17659 ) based on the work of Prof. Dr. Pizza Cutter. While the work itself is technically sound, it has in my opinion several drawbacks. In short, the method is in my opinion unnecessarily complicated, can be easily misleading and is as a result impractical to use. In the following, I will explain these three points of critique, while introducing a simpler and more practical method that works perfectly well for a certain kind of commonly measured data.

1. The method is unnecessarily complicated

This problem actually has been fixed sideways. The originating study “took each player’s plate appearances and numbered them sequentially, from his first to his last. Then split them up into even-numbered and odd-numbered appearances”. This is unnecessarily complicated in the regard that you would need the actual sequential order of outcomes. The idea of the pairwise split means that you want to compare to randomized sub samples against each other. So, you could as well just say that a player that made 30% of 100 threes made the first 30 and missed the other 70, then randomly split those numbers into two subgroups and get the same result.

In later studies, the whole process was made more technically sound and 'simplified', using the Kuder and Richardson Formula 21 (KR 21). I am not sure if simplified is the correct term, because literature like this is not what I would call simple (even gorier vortexes ahead! http://www.real-statistics.com/reliability/kuder-richardson-formula-20/ http://www.education.uiowa.edu/docs/default-source/casma-technotes/technote02.pdf?sfvrsn=2 ). There is one additional complication with this method. The idea of KR 21 is to estimate the internal consistence of a given data set. Studies on stats stabilization turn this question around and increase the data set to see at how many observations per test object KR 21 reaches a value that is deemed '50% signal 50% noise'. This can of course easily lower your number of test objects. For three pointers for example, this number was found to stabilize at 750 attempts. That will most likely decrease the number of bad three point shooters, as those won't be allowed to freely fire away and therefore make the group of controlled players more homogeneous. This leads to the second problem of the methods

2. The method can be easily misleading

Reading a few of the articles that use this method, I have the impression that the authors themselves are not 100% sure what the stabilization point of the KR-21/ pizza cutter implies. The 'number of attempts' value is easily misread as 'please wait approximately X number of attempts before you say anything about a specific player'. A more correct understanding of the value is 'ON AVERAGE you need X attempts of a test object to reliably asses if a player is above or below average'.

Let us use free throws and three pointer as exemplary comparisons.Player A scored on 160 of 200 FT attempts and on 80 of 200 three point attempts. We assume league average for FT% as 75% and for 3P% as 35. The binomial distributions (which is the underlying assumption for KR-21) predicts that an average shooter would make 160 or more free throws less than 5.8% of the time and 80 ore more threes less than 8.1% of the time. Both numbers are very close to the typical value for significance of 5% (aka 'the value that Statisticians bend there data for to reach it'). Yet, I am 100% certain, that the free throw stabilization number will be somewhere around 100 (or even less), while the aforementioned three pointers are only assumed to stabilize after around only 750 attempts.

The reason is simple: Center take a lot of free throws and are usually not very good at it. But three pointers are mostly taken by players that are quite good at it (insert Josh Smith joke here). So, while free throw percentages are all over the place, there are a lot of players that shoot three pointers with around 35% accuracy (or not at all). And while we don't need 750 shots to be certain that Kyle Korver is a good three point shooter, we will have a lot of players for which 2000 attempts won't be enough to have any certainty. This leads us directly to the third problem.

3. The method is impractical in reality

This is mostly for reasons already explained during the last paragraph. The reason is simple. The method wants to answer the question 'Can we say something with certainty about Player A', but instead answers the question 'On average we need 750 attempts to say something about a player'. So, if your goal is to say something about the question 'How similar are players?', then the method is not bad. But then it would probably be easier to just use the KR-21 value for one season and use it to check the internal consistency (its intended use). Instead, it is much more effective to use the general assumptions about a binomial distribution directly and assess every player individually.

4. Solution: Giving each player its own certainty value

So, let's shift our view a little bit. Our assumption is 'there is a yes/no situation with an average percentage of μ% yes', a typical binomial distribution. Our question is 'How likely is it that Player A, who has X yes on Y tries is the result of this distribution'. This is a typical statistical question that we can easily estimate. If the percentage X/Y of yes is below the average μ, we sum up all the probabilities of the binomial distribution that a player has X or less yes. If the percentage is above μ, we sum up all the probabilities of X or more. This gives you a certainty value, which you have to multiply by two (long story short: Because you have a two tailed test http://en.wikipedia.org/wiki/One-_and_two-tailed_tests ).

This is easily computated, you only absolute values of attempts and mades for each player and an estimate for the average percentage μ. For the latter, there are several simple possibilities. You can either take sum of all mades divided by sum of all attempts, or you take the median individual player percentage. The first might overvalue players that shoot a lot and the second might overvalue players with only few attempts, so in both cases it might be necessary to use a minimal threshold. On the plus side, you can easily include player with at least 50 attempts into your calculation. They might be up to 10% of their real probability, but this should be distributed quite evenly in both directions. Furthermore, a player with less attempts has a much higher confidence interval – which I will show you in the following

5. Result: comparing stabilization of free throw and three pointer numbers

In the following, I will simply show how this attempt looks like for free throws and three pointer, also highlight the limit to the whole thing. Let us start by looking at all attempts and made percentages for both shot types during the regular season 2013/14:

oth values for correlation and linear fit include only players with more than 50 attempts (dotted vertical line). The dotted curves indicate the 2.5 and 97.5 percent probability for the assumed binomial distributions. For free throws 43.3% of players are outside of this 95% confidence interval. For three pointers it are merely 10.9%. This underlines my assumption that you would need less than 100 attempts to stabilize free throw attempts. 10.9% is not that much, given that the theory of binomial distribution would assume that 5% of outliers would happen just by random chance. Two obvious factors reduce the number of outliers for three pointers. The first is that bad shooters are usually quickly asked to stop shooting. The second is that good shooters are usually asked or required to take more complicated shots (as shown here http://nyloncalculus.com/2015/01/19/quick-firing-kyle-korver-co/ ), something that is not the case for free throws. If we now want to see the pValue (the probability that an outcome could be due to random occurrence), for our three point shooter, we can use a slightly different plot

he horizontal dotted lines mark where there is an only 5% probability that the result is a random occurrence. So, if you want to read this plot for a specific player, you could say for example 'disregarding effects like differences in defensive pressure, there is an exp(-6.82)=0.13% probability that Josh Smith was in reality an average 3 point shooter during the 13/14 season' or 'disregarding effects like differences in defensive pressure, there is an exp(-12.9)=0.00051% probability that Kyle Korver was in reality an average 3 point shooter during the 13/14 season'. Which in both cases doesn't need 750 attempts.

6.Epilog: Going full circle

Now, I hope I could convince you that this approach is more elegant and telling then the previous way of looking at stats stabilization. (I especially hope that Seth uses it for some of his studies, especially for uncontested team three point defense (Houston etc.) and shot contesting.). One elegant addition is that you can visually show things that are similar to KR-21. The idea is to compare the percentage distribution of your players with a simulated distribution that assumes all players have the same average probability of yes, but different amounts of tries. The reason I simulate this distribution instead of solving it analytically is because ~~I am lazy~~ different amounts of attempts lead to different shapes. If all players would have 50 attempts, we would get a much broader distribution than if all players had 500 attempts. As the distribution is mixed, I simply simulate the random draw 100 times for every player and add the whole thing up to a histogram. Here the results for free throws and three pointers

Unsurprisingly, the shapes between real data and simulation is much closer for three pointers than for free throws. I have no underlying technical formula that describes this similarity – but I guess it is a start :)

Cheers,

Hannes

P.S.: I have one more side note. in the case that your only information is made or miss, a binomial distribution works perfectly well. Other, more complicated examples that allow for more outcomes then yes or no could be implemented using bootstrapping (I guess...). Bootstrapping is not necessary for binomial distributions, because if you bootstrap a binomial distribution (which is more or less what Prof. Dr. Pizza Cutter did), you would simply get an estimation of the same binomial distribution (I have no proof for this, but it sounds reasonable to me...).

SportsTribution - rambling about sport and data

Friday, January 30, 2015