Saturday, May 10, 2014

Bridging the gap, Part III - Throwing the kitchen sink

Hi everyone,

for now, part III is going to be the final chapter of my Plus Minus / box score / tracking data comparison. Basically like 'Return of the Jedi', only with less Ewoks.
After looking at correlations between offensive and defensive Plus Minus and pairs of other data, I decided to look at general plus minus and everything at the same time. Using 17 different statistics, I am going to look at all possible combinations, to find the best ones for predicting the data. All possible combinations are in this case 217-1 = 131071 = a shit load of possibilities. I also decided to add a new type of players to my previous groups of Point Guards, Wings & Bigs, which I called 'Very Bigs'. It's basically only those bigs that are not attempting any three pointer. The idea is to get a more homogeneous group.

For testing all possible combinations, I have to consider the following: If I would use straight-forward multiple linear regression with 17 predictors and 52 Point Guards, I would get a super R2 value (around 0.8), but basically I would simply overfit the whole system. 
So, there is one important analytical trick I have to add to get results of any value: cross-validation. In my case, cross-validation simply means that I split my Point Guards into 5 groups and then use 4 of these groups to 'train' my regression model and the fifth group to test the predictive value. And then I rotate these sets so that each group is once tested against the four others. After I am finished with one round of these, I will split my Point Guards into 5 different groups and repeat the whole thing. With this, I make sure that my results are not biased by the selected groups. It is good that computer became so fast during the last years, because mine now had to do 131,071 (# of combinations) * 5 (# of cross validation groups) * 4 (# of player types) * 2 (# of general repeats)  = 5,242,840 regression calculations. Which nowadays takes probably around an hour.
What you will see in the following, is that the combination of statistics that have the best prediction are usually combinations of around 8 or 9 statistics. This is in part simply due to the fact that there are the most different possibilities for combinations of 8 or 9. If I use only one predictor, I have 17 possible combinations. The same for 16 predictors (I can leave one of the 17 out). But my number of possible permutations that use 8 predictors is - I don't know right now, but it's definitely way higher than 17.
So, optimally, I would like to pick combinations of around 5 predictors with good correlation values. But take a look at the results for yourself. The figures show for each number of predictors and each predictor the maximum R2 value:

What we see is basically the following: It is possible to see in box score numbers what comes with being a good point guard. Things like scoring effectively and distributing the ball and stuff. In addition it seems to be indicative if you can grab contested rebounds and are good at contesting opponents at the rim. For wing players, things are similar, yet slightly less indicative. Probably due to the less homogeneous form of that group, which merges defensive specialists like Tony Allen with scoring machines like Kevin Durant (just like the first round of the Play Offs). But I have to admit, at least with my data set it gets pretty messy for Bigs.
I guess the general message is: As soon as you start to handle your data carefully, your predictions start to loose their boldness. Data analysis is a noisy thing. So don't trust anybody with a small data set and a bold announcement. Which also doubles as a good advice when you meet a stranger in a bar ;)

I wish all of you a nice weekend,
Hannes

No comments:

Post a Comment