Quote:
Originally Posted by Taco MacArthur
Some thoughts on the paper...
It's true that regression to the mean is a real thing with shooting percentage, but as Fourier pointed out (and I'm going to state it differently, so please correct me if you disagree) each player is different (and is going to have their own longterm "real" shooting percentage that we'll never know). This would further be modified by the situations that a player is placed in and the teammates he plays with. There's also a significant amount of small sample variability, which would regress to a player/situation's "mean" over the long term, but to say that everything regresses to the same mean is throwing the baby out with the bathwater.
The paper essentially drives towards a metric more refined than CORSI (which itself is a more refined version of plusminus, although Jim Corsi never intended it as such when he developed it). I like the logistic Bayesian approach that they use  we've done similar techniques in predictive modeling efforts, and I'm typically happy with the results.
They also ignore PP/SH/OT goals in the analysis (although they acknowledge that the modeling approach could handle it)  this is a minor flaw in my mind, not a major one (and in that sense, they measure exactly what they say that they're going to measure, and don't hide that fact  it's a proxy for evenstrength performance).
The use of a Laplacian prior distribution to guard against overfitting their solution is an interesting one (and something that I hadn't considered before). I want to think more about this  it isn't immediately intuitive to me  but it seems to work well.
A lot of these modeling approaches rely upon the shot (and other data) from the league  while I prefer these approaches in general, they do suffer from a lack of consistency in recording, and until those kinks are controlled for, I like that the authors here don't rely upon it. I think that each approach could benefit from the other, and that there will be a meeting in the middle at some point.
I'm not sure that I like how they incorporate goaltenders into the mix; traditionally, goaltenders haven't been measured by plusminus well. On the other hand, you could make the same argument that forwards and defensemen have different roles on the ice, and to that end, they simply let the model fit to what was there. I'd like to see the full breadth of data to analyze, to see if there is a bias in the results towards/against certain positions or types of players (such as defensive forwards). In general, that's my least favorite part of the analysis  you can't see the calculations or the playerlevel results.
Regarding the "value for money" concept, I disagree with using "average" as the baseline (if I truly understand what was done here)  average players definitely have positive value; this is borne out both by empirical evidence in terms of what teams offer average players, and in terms of certain teams' failings for the lack of such averageness at critical positions/times.
I've been exploring doing something similar to the MCMC approach, but using an agentbased (complex systems) approach to player evaluation. The authors note that a more realistic scenario would be looking at a game as a series of Poisson processes (with lambda based upon who was on the ice), and I believe that this would fall out readily from a CSbased approach.

I have been wanting to run a model along the lines of...
log(rate)=OFFTeamPlayerA+...+OFFTeamPlayerFDEFTeamPlayerA...DEFTeamPlayerF+PrevailingGoalRate
along with appropriate offset adjustment and adjustments for number of skaters on ice.
There will be SERIOUS issues with this model with MCMC and the rest and its likely I'd persue a random effects type model. That's just going to be the way of life as there's no real way to handle sampling of all the player terms as a vector. One conditioned on the rest all the way through.
That being said, this doesn't rule out that this could compute fastish as each conditional step would be a logconvex distribution and thus an adaptive envelope can be placed given fast sampling (Gilks, 1992). MCMC methods tend to be high correlation monte carlo methods. One way around it, go very large.
There's also an issue in regards to the fact that all the contrasts are dependent save any action from a prior used in the Bayesian setting. That is, players are only relative to each other... same as teams.
Now is this going to be "right"... of course not, but I think it'll take things further than before.
Of course, this shouldn't really influence the notionals of the play of the game. It'll be more interesting to see what it draws out. On the other hand, large variances will be unavoidable.

edit: For the record i haven't read the paper. quick googling says this is in the machine learning field so I suspect there may be some dimension reducing techniques of some variety.
edit #2: Poisson process is usually the easiest way to go. Frankly this is unforsakenly complex to start. I'm unfamiliar with agentbased models, I've been told by our local expert there isn't really a core. No fundamental law of _____. I don't have any interest into going into it as such. Of course, I think at heart this is stuff I might have REALLY been interested in as I was hoping my studies would lead me into things that behave dynamically.
Nevertheless, my understanding is that agent models and statistics are in some ways difficult to square as the paradigms are wholly different. One is data in search of a model, the other a model in search of data... and never the two shall meet.