Estimating Player Contribution in Hockey with Regularized Logistic Regression
PDF : Jensen_2013_Estimating_Player_1.pdf
Estimating Player Contribution in Hockey with Regularized Logistic Regression Robert B. Gramacy, Matthew A. Taddy, Shane T. Jensen (Submitted on 22 Sep 2012 (v1), last revised 12 Jan 2013 (this version, v2)) We present a regularized logistic regression model for evaluating player contributions in hockey. The traditional metric for this purpose is the plusminus statistic, which allocates a single unit of credit (for or against) to each player on the ice for a goal. However, plusminus scores measure only the marginal effect of players, do not account for sample size, and provide a very noisy estimate of performance. We investigate a related regression problem: what does each player on the ice contribute, beyond aggregate team performance and other factors, to the odds that a given goal was scored by their team? Due to the largep (number of players) and imbalanced design setting of hockey analysis, a major part of our contribution is a careful treatment of prior shrinkage in model estimation. We showcase two recently developed techniques  for posterior maximization or simulation  that make such analysis feasible. Each approach is accompanied with publicly available software and we include the simple commands used in our analysis. Our results show that most players do not stand out as measurably strong (positive or negative) contributors. This allows the stars to really shine, reveals diamonds in the rough overlooked by earlier analyses, and argues that some of the highest paid players in the league are not making contributions worth their expense. An exception is Pavel Datsyuk, who stands out as the league's very best, having a coefficient that is unmoved even after considering the strong team effect of his Red Wings. PRNewswire story that links to a release from U of Chicago: http://www.prnewswire.com/newsrelea...10979841.html? U of Chicago article: http://www.chicagobooth.edu/about/ne...30213hockey Quote:

Interesting find. I'll give this a read if I can find the time.

This is one of my favorites  I actually mentioned it at the conference I was at this past week (primary assist to Fugu! :laugh: ).

You two gents will understand why this is my favorite part of this study:
Quote:

Quote:

I'm sorry to burst all of your bubbles, but this study is a perfect example of academics who have the tools (i.e. knowledge of statistical analysis) but don't know anything at all about hockey and/or hockey analytics.
First problem is that the study is only focusing on goals, which fail to account for the shooting percentage variance that we see in the NHL. (see http://www.broadstreethockey.com/201...ageregression for evidence of SH% regression to the mean evidence). shot differential analysis with regressed SH% are far better than just looking at goal rates. Secondly, the article tries to say that Dwayne Roleson had an impact offensively (which of course we know is not true since he's a goalie). There are several other glaring simple hockey errors as well in the study. For a more elegant criticism of the referenced article, see this article http://nhlnumbers.com/2013/6/12/stee...ockeyproblems Eric T wrote the criticism of the article. He is by the way one of the leaders in the field of hockey analytics (he has had work featured at the Sloan sports conference). He writes for NHL numbers and Broadstreethockey.com 
Quote:
I actually noticed the Roloson references right away and had a nice little laugh. BUt I have not had time to really look at the article in detail. Interesting that one of the papers you quote is about on ice shooting %age and regression to the mean. This is a theory that is regularly used to make claims that also ignore what really happens on the ice. The point is that without context hockey is an extremely difficult game to quantify. My general feeling is to be very skeptical about any such study, but I am happy to see what the authors have in mind. 
Quote:
But SH% regression is not as much theory as fact. It has been proven (in that article I linked to in my first post) that shooting percentage does regress to the mean. I am no mathematician, but doesn't Eric's article provide pretty good evidence of SH% regression. That is of course not to say that there is not talent in SH%, but that some of the differences in player SH% in luck. I don't think SH% regression ignores what goes on in a game at all. Goals are such rare events that luck has a lot to do with each goal. A player can take the same shot 100 times, and score 10 times, but of those 10 goals, 1 deflects off an opponent and in, another goes in because the goalie is screened, a third might go in because the goalie is out of position. These 3 goals represent the goal scorer being lucky (not to say the shots weren't good shots, but the only difference between a goal and a save is often a random event). The exercise is a very valuable one, but only if the academics (who are very smart and know a ton about statistical analysis) are up to date with modern hockey analytics (which has been advanced mostly by bloggers). Also, if someone doesn't know much about hockey, which these guys clearly don't since they thought a goalie could influence offensive play, the application of advanced statistical analysis is not all that useful. I'm not trying to bash these guys,but we see the same thing in baseball Sabremetrics. Much of the work, especially early on, was done by bloggers and nonacademics. When academics got involved, rather than catching up on what they missed (the years of research and findings by others), they dive into the numbers. As a result, they end up answering problems that have already been solved or doing inadequate research. We see this even today in baseball Sabremetrics. 
Quote:

Quote:
Overall shooting percentage doesn't vary much across the tens of thousands of shots taken across the NHL every year at even strength (iirc it's on the order of 5060 thousand league wide), however it varies widely from year to year during the time a single player is on the ice (iirc somewhere on the order of 1000 for first line players) Clearly whenever you increase the number of events you're looking at by 1.52 orders of magnitude you'd expect that the randomness inherent in those would calm down a bit, which is why shots are preferred to goals as there's roughly an order of magnitude more shots that goals. 
Quote:
Even at the team level, a team can shoot at unsustainable rates for a full season. The LA Kings in the 20112012 season shot at 7.49%. That is awful, but it was more a product of bad luck than anything. Their poor shooting percentage was one of the main reasons that the Kings barely made the playoffs. This season, with a very similar roster, the Kings shot above 9%. Unless you believe the all the Kings players got better at shooting, you have to realize that their SH% regressed back to the mean this season. So if at the team level luck plays a big portion in a team's shooting percentage, think about how much luck has to do with the variance of shooting percentages among players. 
Quote:
You say that SH% regression is a fact. What is the fact that you are talking about? Generally speaking the notion of "regression to the mean" is a broadly used term that suggests that extreme observations tend not to be repeated. Unfortunately the term itself is without context rather vague since it is often not possible to determine if a particular piece of data is extreme without a careful understanding of the nature of your data set. In hockey, when you apply it to a specific player, you often destroy its applicability. I can't say that I follow every blogger but as an Oiler fan I can tell you that there are a lot in our fan base who claim to use analytics (or more often advanced stats which in my mind are a somewhat different animal). Regression to the mean is one such concept that is butchered frequently by many bloggers. 
Quote:
Did you read the article I linked to by Eric T. He shows how shooting percentages regress to the mean. So what I mean by saying "SH% regress to the mean is fact" is that it has been shown to be true (which certain people in the blogosphere, see David Johnson, doubt for some reason). Now if the mathematicians on this board don't see Eric T's article as enough evidence, then I'm listening. You two know more about it than me. But it seems to me that Eric presents enough evidence to prove that SH% in the NHL regress towards the mean. And the point is that if you agree with me that shooting percentage in the NHL will regress to the mean, then the study that started this thread clearly is not very well researched, since it is using strictly goal rates that fail to take into account the randomness of SH%. 
Quote:
Regression towards the mean may not be intuitive, but that doesn't mean that it doesn't apply (it does everytime our correlation isn't perfect, which is almost always) 
Quote:

Quote:
with all due respect =) The problem is probably not his understanding of maths as much as his understanding of this particular statistical phenomenon. Don't get me wrong, I don't hold it against him if he is mistaken when it comes to this particular point (many concepts in statistics are counterintuitive). 
Quote:
You have actually illustrated my previous point in your own post by isolating Jagr vs some generic player named Mister X. Why did you do this? How would you deal with a kid like Nail Yakupov? Is he Stamkos or Elmer Fudd? 
Quote:
That said the premise that he is putting forth is perfectly reasonable. But where my objections lie are in how the notion is interpreted by people ignorant of its limitations. Yes I will agree with the broadly stated claim that shooting percentage in the NHL will "regress to the mean" at least terms of the accepted meaning of the term. But as I have said above I think that this in itself is virtually useless in saying much about an individual, especially in ignorance of the context. I have not yet read the study in question so at this point I am neither going to praise it or to criticize it. I would however be happy to hear Taco's thoughts on this paper as I am sure he is in a position to evaluate its merits. 
Quote:
Quote:
Let's say we have a population of players and that we've only been observing them for the first half of one season. We happen to know the correlation coefficient r for sh% between one halfseason and the next and believe that this correlation is unchanged (yes, I'm mostly transcribing Eric Ts post right now; stick with me, though). Now imagine that all these players are named something like X, Y etc.; the point to this is that we don't know anything about them except for how they've done this season. Then, I'd contend, estimating each individuals performance for the rest of the season would be best done by assuming it regresses to the mean by as much as (1r)*abs(deviance from the mean). If we think that r is 0, then we basically state that all deviance is the product of luck and that the mean is the only thing that matters. A value between 0 and 1 assumes a component of skill, but with some variance due to luck. If anyone has any questions about that part, please ask. That's pretty much the basics of the concept. Now, as it relates to the discussion at hand: Let's say that we do have more information about the players under investigation. We know their record for previous seasons and whatever else we'd consider relevant. Obviously now we wouldn't use the same method that we used in the previous example where each player was labeled x,y,z etc. Let's say one of these players is Jagr, who's shot at NHLavg+1% this halfseason, but has a career average of say NHLavg+5%. In this case, we wouldn't assume that his sh% would go down but rather the opposite, but this as well would be an example of regression to the mean (where in this case the mean would be NHLavg+5% and the rvalue would be the correlation coefficient between one Jagrhalfseason and the next). I haven't really followed the entire regression to the mean debate, so I might be misconstruing your position (and if so, I apologise). From what little I've read though it does seem as if a lot of people are genuinely misunderstanding this. EDIT: Oh wait, Elmer Fudd :banghead: And also, the first example was written for lurkers who don't know about the concept, I didn't mean to imply that you don't understand this 
A quick note on the players.
[Edit: All shooting percentage numbers include ES, PP and PK time. I just went to ESPN to grab them, I didn't bother to pull ES only numbers, however they are very similar) Nail Yakupov was the #1 overall pick last year (2012). He shot at a 21.0% clip this year. Steven Stamkos was the #1 overall pick in 2008. In his rookie year he shot 12.7%, but since hasn't had a year below 16.5% and has an overall career percentage of 17.2% over 1200+ shots. The point I believe Fourier was making is that you need additional information about players to make any sort of regression meaningful. Is Yakupov of the same breed as Stamkos who can consistently shoot at well, well above the league average? Or is he of the same breed as Jordan Staal who had an absurdly lucky rookie year (22.1% [131 shots] compared to 11.2% [898 shots] since)? Without knowing that information, you can't honestly analyze Yakupov by regressing his numbers because you have no idea what you should be regressing towards. Yes, the study uses purely goal based metrics which inherently adds a good chunk of 'luck' into the situation however simply saying "use shots instead" ignores the very real fact that shooting percentage *is* a skill. You can't just say "Shooting percentage regresses to the mean so we should use shots instead of goals." Yes, shooting percentage does tend to regress towards the mean, but the problem is establishing what that mean is (per player, not league wide) and then controlling for changes in that mean (it's entirely possible for a player to improve his shooting, or to have his shooting deteriorate with age). 
Quote:
I wonder for example if Dave Semenko's SH% was higher while playing with #99 than it might have been while he was toiling in the bottom six. ;) 
Quote:
Quote:

Always some good discussions around here. On the sh% thing. Just want to point that not all shots are the same. As obvious as it may sound, players score at a much higher rate in the slot area (or close to the net) vs. shots taken from the point. Simply taking the # of shots or sh% thing doesn't really cut it for me either.

Quote:
I agree. This is exactly the argument I use against expanding ice size. People assume that defenders will abandon their posts and just skate around the perimeter with the forwards, although at least in transition there is more room for better skaters to take advantage of the situation. This would just lead to tougher defensive systems from coaches (like the neutral zone trap). You're right that the shots need to be qualified. Specifically with Datsyuk, I'd like to hear what people think about his stats using the exact same criteria for all players mentioned, why is he standing out? 
Some thoughts on the paper...
It's true that regression to the mean is a real thing with shooting percentage, but as Fourier pointed out (and I'm going to state it differently, so please correct me if you disagree) each player is different (and is going to have their own longterm "real" shooting percentage that we'll never know). This would further be modified by the situations that a player is placed in and the teammates he plays with. There's also a significant amount of small sample variability, which would regress to a player/situation's "mean" over the long term, but to say that everything regresses to the same mean is throwing the baby out with the bathwater. The paper essentially drives towards a metric more refined than CORSI (which itself is a more refined version of plusminus, although Jim Corsi never intended it as such when he developed it). I like the logistic Bayesian approach that they use  we've done similar techniques in predictive modeling efforts, and I'm typically happy with the results. They also ignore PP/SH/OT goals in the analysis (although they acknowledge that the modeling approach could handle it)  this is a minor flaw in my mind, not a major one (and in that sense, they measure exactly what they say that they're going to measure, and don't hide that fact  it's a proxy for evenstrength performance). The use of a Laplacian prior distribution to guard against overfitting their solution is an interesting one (and something that I hadn't considered before). I want to think more about this  it isn't immediately intuitive to me  but it seems to work well. A lot of these modeling approaches rely upon the shot (and other data) from the league  while I prefer these approaches in general, they do suffer from a lack of consistency in recording, and until those kinks are controlled for, I like that the authors here don't rely upon it. I think that each approach could benefit from the other, and that there will be a meeting in the middle at some point. I'm not sure that I like how they incorporate goaltenders into the mix; traditionally, goaltenders haven't been measured by plusminus well. On the other hand, you could make the same argument that forwards and defensemen have different roles on the ice, and to that end, they simply let the model fit to what was there. I'd like to see the full breadth of data to analyze, to see if there is a bias in the results towards/against certain positions or types of players (such as defensive forwards). In general, that's my least favorite part of the analysis  you can't see the calculations or the playerlevel results. Regarding the "value for money" concept, I disagree with using "average" as the baseline (if I truly understand what was done here)  average players definitely have positive value; this is borne out both by empirical evidence in terms of what teams offer average players, and in terms of certain teams' failings for the lack of such averageness at critical positions/times. I've been exploring doing something similar to the MCMC approach, but using an agentbased (complex systems) approach to player evaluation. The authors note that a more realistic scenario would be looking at a game as a series of Poisson processes (with lambda based upon who was on the ice), and I believe that this would fall out readily from a CSbased approach. 
All times are GMT 5. The time now is 12:42 PM. 
vBulletin Copyright ©2000  2016, Jelsoft Enterprises Ltd.
HFBoards.com, A property of CraveOnline, a division of AtomicOnline LLC ©2009 CraveOnline Media, LLC. All Rights Reserved.