# Revisiting Stats stabilization

a warning before you start reading this. You can find a more polished version at Nylon Calculus (memo to myself: add link here as soon as you got it). They also published another piece of mine and have a lot of other great stuff. But if I would draw a Venn Diagram of people that read my blog and people that read Nylon Calculus, I am pretty sure that you know all this already...

This version has a bit more (probably boring) details on why I find previously used methods impractical. It also has a bit more shiny plots, which in the end where not helpful for understanding. So, if you are here for the shiny plots scroll down to the end. There is also an R script so that you can produce shiny plots yourself. You can find a github for the R function that I wrote here.

Side note: One reason for this blog entry is that I'm starting to move from Matlab to R. If you find technical flaws in it let me know. :)

Hello everybody,

over the last years there seems to be one main way to estimate the stabilization of a stat, (

This version has a bit more (probably boring) details on why I find previously used methods impractical. It also has a bit more shiny plots, which in the end where not helpful for understanding. So, if you are here for the shiny plots scroll down to the end. There is also an R script so that you can produce shiny plots yourself. You can find a github for the R function that I wrote here.

Side note: One reason for this blog entry is that I'm starting to move from Matlab to R. If you find technical flaws in it let me know. :)

Hello everybody,

over the last years there seems to be one main way to estimate the stabilization of a stat, (

__http://nyloncalculus.com/2014/08/29/long-take-three-point-shooting-stabilize/__,__http://www.fangraphs.com/blogs/stabilizing-statistics-interpreting-early-season-results/__,__http://www.baseballprospectus.com/article.php?articleid=17659__) based on the work of Prof. Dr. Pizza Cutter. While the work itself is technically sound, it has in my opinion several drawbacks. In short, the method is in my opinion unnecessarily complicated, can be easily misleading and is as a result impractical to use. In the following, I will explain these three points of critique, while introducing a simpler and more practical method that works perfectly well for a certain kind of commonly measured data.### 1. The method is unnecessarily complicated

This problem actually has been fixed sideways. The
originating study “took each player’s plate appearances and
numbered them sequentially, from his first to his last. Then split
them up into even-numbered and odd-numbered appearances”. This is
unnecessarily complicated in the regard that you would need the
actual sequential order of outcomes. The idea of the pairwise split
means that you want to compare to randomized sub samples against each
other. So, you could as well just say that a player that made 30% of
100 threes made the first 30 and missed the other 70, then randomly
split those numbers into two subgroups and get the same result.

In later studies, the whole process was made more
technically sound and 'simplified', using the Kuder and Richardson
Formula 21 (KR 21). I am not sure if simplified is the correct term,
because literature like this is not what I would call simple (even gorier vortexes ahead!

__http://www.real-statistics.com/reliability/kuder-richardson-formula-20/____http://www.education.uiowa.edu/docs/default-source/casma-technotes/technote02.pdf?sfvrsn=2__). There is one additional complication with this method. The idea of KR 21 is to estimate the internal consistence of a given data set. Studies on stats stabilization turn this question around and increase the data set to see at how many observations per test object KR 21 reaches a value that is deemed '50% signal 50% noise'. This can of course easily lower your number of test objects. For three pointers for example, this number was found to stabilize at 750 attempts. That will most likely decrease the number of bad three point shooters, as those won't be allowed to freely fire away and therefore make the group of controlled players more homogeneous. This leads to the second problem of the methods### 2. The method can be easily misleading

Reading a few of the articles that use this
method, I have the impression that the authors themselves are not
100% sure what the stabilization point of the KR-21/ pizza cutter
implies. The 'number of attempts' value is easily misread as 'please
wait approximately X number of attempts before you say anything about
a specific player'. A more correct understanding of the value is 'ON
AVERAGE you need X attempts of a test object to reliably asses if a
player is above or below average'.

Let us use free throws and three pointer as
exemplary comparisons.Player A scored on 160 of 200 FT attempts and
on 80 of 200 three point attempts. We assume league average for FT%
as 75% and for 3P% as 35. The binomial distributions (which is the
underlying assumption for KR-21) predicts that an average shooter
would make 160 or more free throws less than 5.8% of the time and 80
ore more threes less than 8.1% of the time. Both numbers are very
close to the typical value for significance of 5% (aka 'the value
that Statisticians bend there data for to reach it'). Yet, I am 100%
certain, that the free throw stabilization number will be somewhere
around 100 (or even less), while the aforementioned three pointers
are only assumed to stabilize after around only 750 attempts.

The reason is simple: Center take a lot of free
throws and are usually not very good at it. But three pointers are
mostly taken by players that are quite good at it (insert Josh Smith
joke here). So, while free throw percentages are all over the place,
there are a lot of players that shoot three pointers with around 35%
accuracy (or not at all). And while we don't need 750 shots to be
certain that Kyle Korver is a good three point shooter, we will have
a lot of players for which 2000 attempts won't be enough to have any
certainty. This leads us directly to the third problem.

### 3. The method is impractical in reality

This is mostly for reasons already explained
during the last paragraph. The reason is simple. The method wants to
answer the question 'Can we say something with certainty about Player
A', but instead answers the question 'On average we need 750 attempts
to say something about a player'. So, if your goal is to say
something about the question 'How similar are players?', then the
method is not bad. But then it would probably be easier to just use
the KR-21 value for one season and use it to check the internal
consistency (its intended use). Instead, it is much more effective to
use the general assumptions about a binomial distribution directly
and assess every player individually.

### 4. Solution: Giving each player its own certainty value

So, let's shift our view a little bit. Our
assumption is 'there is a yes/no situation with an average percentage
of Î¼%
yes', a typical binomial distribution. Our question is 'How likely is
it that Player A, who has X yes on Y tries is the result of this
distribution'. This is a typical statistical question that we can
easily estimate. If the percentage X/Y of yes is below the average Î¼,
we sum up all the probabilities of the binomial distribution that a
player has X or less yes. If the percentage is above Î¼,
we sum up all the probabilities of X or more. This gives you a
certainty value, which you have to multiply by two (long story short:
Because you have a two tailed test

__http://en.wikipedia.org/wiki/One-_and_two-tailed_tests__).
This
is easily computated, you only absolute values of attempts and mades
for each player and an estimate for the average percentage Î¼. For
the latter, there are several simple possibilities. You can either
take sum of all mades divided by sum of all attempts, or you take the
median individual player percentage. The first might overvalue
players that shoot a lot and the second might overvalue players with
only few attempts, so in both cases it might be necessary to use a
minimal threshold. On the plus side, you can easily include player
with at least 50 attempts into your calculation. They might be up to
10% of their real probability, but this should be distributed quite
evenly in both directions. Furthermore, a player with less attempts
has a much higher confidence interval – which I will show you in
the following

### 5. Result: comparing stabilization of free throw and three pointer numbers

In
the following, I will simply show how this attempt looks like for
free throws and three pointer, also highlight the limit to the whole
thing. Let us start by looking at all attempts and made percentages
for both shot types during the regular season 2013/14:

Both
values for correlation and linear fit include only players with more
than 50 attempts (dotted vertical line). The dotted curves indicate
the 2.5 and 97.5 percent probability for the assumed binomial
distributions. For free throws 43.3% of players are outside of this
95% confidence interval. For three pointers it are merely 10.9%. This
underlines my assumption that you would need less than 100 attempts
to stabilize free throw attempts. 10.9% is not that much, given that
the theory of binomial distribution would assume that 5% of outliers
would happen just by random chance. Two obvious factors reduce the
number of outliers for three pointers. The first is that bad shooters
are usually quickly asked to stop shooting. The second is that good
shooters are usually asked or required to take more complicated shots
(as shown here

__http://nyloncalculus.com/2015/01/19/quick-firing-kyle-korver-co/__), something that is not the case for free throws. If we now want to see the pValue (the probability that an outcome could be due to random occurrence), for our three point shooter, we can use a slightly different plot
The
horizontal dotted lines mark where there is an only 5% probability
that the result is a random occurrence. So, if you want to read this
plot for a specific player, you could say for example 'disregarding
effects like differences in defensive pressure, there is an
exp(-6.82)=0.13% probability that Josh Smith was in reality an
average 3 point shooter during the 13/14 season' or 'disregarding
effects like differences in defensive pressure, there is an
exp(-12.9)=0.00051% probability that Kyle Korver was in reality an
average 3 point shooter during the 13/14 season'. Which in both cases
doesn't need 750 attempts.

### 6.Epilog: Going full circle

Now, I hope I could convince you that this
approach is more elegant and telling then the previous way of looking
at stats stabilization. (I especially hope that Seth uses it for some
of his studies, especially for uncontested team three point defense
(Houston etc.) and shot contesting.). One elegant addition is that
you can visually show things that are similar to KR-21. The idea is
to compare the percentage distribution of your players with a
simulated distribution that assumes all players have the same average
probability of yes, but different amounts of tries. The reason I
simulate this distribution instead of solving it analytically is
because ~~I am lazy~~ different amounts of attempts lead
to different shapes. If all players would have 50 attempts, we would
get a much broader distribution than if all players had 500 attempts.
As the distribution is mixed, I simply simulate the random draw 100
times for every player and add the whole thing up to a histogram.
Here the results for free throws and three pointers

Unsurprisingly, the shapes between real data and simulation is much closer for three pointers than for free throws. I have no underlying technical formula that describes this similarity – but I guess it is a start :)

Cheers,

Hannes

P.S.: I have one more side note. in the case that your only information is made or miss, a binomial distribution works perfectly well. Other, more complicated examples that allow for more outcomes then yes or no could be implemented using bootstrapping (I guess...). Bootstrapping is not necessary for binomial distributions, because if you bootstrap a binomial distribution (which is more or less what Prof. Dr. Pizza Cutter did), you would simply get an estimation of the same binomial distribution (I have no proof for this, but it sounds reasonable to me...).

P.S.: I have one more side note. in the case that your only information is made or miss, a binomial distribution works perfectly well. Other, more complicated examples that allow for more outcomes then yes or no could be implemented using bootstrapping (I guess...). Bootstrapping is not necessary for binomial distributions, because if you bootstrap a binomial distribution (which is more or less what Prof. Dr. Pizza Cutter did), you would simply get an estimation of the same binomial distribution (I have no proof for this, but it sounds reasonable to me...).

## No comments:

## Post a Comment