View Single Post
11-19-2012, 05:59 PM
#23
Registered User

Join Date: Jan 2006
Location: bohemia
Country:
Posts: 4,846
vCash: 500
Quote:
 Originally Posted by barneyg Your math problem (Gretzky's effect on Anderson, Lemieux's effect on Rob Brown..) is the anti-stats people complaint in a nutshell -- "hockey's a team game, so individual stats have to be flawed and you can't predict them". My personal opinion is exactly the same if you replace "you can't" with "it's tough to".
I'm not really trying to predict future values of Y, although I would expect a good model to have good predictive value (once the values for the variables are input). I'm more interested in trying to level the ground for comparison. From what I've seen, the production of a Gretzky or Lemieux is usually not heavily influenced by the linemate/teammate effect, while the production of their teammates often is heavily influenced by playing with such great players.

Basically, I'm trying to remove the "Gretzky/Lemieux/Orr/? had a great year and some other players came along for the ride" effect in an objective manner, since it could substantially influence observed Y.

Quote:
 Originally Posted by barneyg Not sure your model with the dummy variables would achieve your objective per se -- the Gretzky dummy would only be equal to 1 for observations pertaining to Gretzky... Coffey would only have the Coffey dummy.. Kurri, the Kurri dummy... if Anderson had been free-riding on all those guys you still wouldn't catch it.
In the model with dummy variables, I'm not trying to capture any teammate effects (at least not at this point). The dummy variables for each player were to filter out the effect of that player's quality on the observed Y (individual's points that season), so that Gretzky's points at age A in Season S wouldn't be expected to be the same as a "normal" first line player at age A in season S.

Quote:
 Originally Posted by barneyg You're basically looking for a "quality of teammates" measure.. instead of going all the way into a different model with thousands of observations (every player, every year), couldn't you come up with a measure of how "concentrated" Y is? i.e. Y is adjusted scoring of top N players, what is the % of those N players that play on the top X teams? i.e. if in 1983-84 you have 5 Oilers and 3 Bruins out of the top 21 players, that's a "top 2 team" concentration of 8/21 = 38%. Of maybe some measure of distance between the very top player (Gretz 205) and the best player not on the Oilers (Goulet 121)..
I think this is a very good idea, and exactly the kind of input I was hoping for to improve the model in an objective manner.

It appears you may have misunderstood the other potential model with dummy variables. I was planning to (once workable) include a large number of observations of Y (player-season), but still limit it to top tier players, since it's the effects on those players that we're generally most interested in (i.e. those are the players most frequently compared). Players like Rob Brown wouldn't be in that model. Players like Kurri, Coffey and possibly Anderson would be, but the Y values for each of those players often had similar "teammate effects" in multiple seasons, so I expected much of that to be captured in the dummy variable for that individual player. Again, my main objective was to use the coefficient for each season, and probably age, as the basis for further analysis. The individual player coefficients would be interesting, but the purpose of their presence would be primarily to increase the accuracies of the coefficient for variables in other categories.

Quote:
 Originally Posted by barneyg I would rephrase the bolded as: I don't really care how these changes influenced each other, all I want is to find a way to make the model more stable with respect to the implied relationship between each of those variables and adjusted scoring. I think going back in time mitigates this problem. If that means dropping PPG or PPO because of data availability, so be it -- you can always compare the other coefficients for (say) 1950-2010 with the model for 1964-2010 which will include it.
My goal was to create the most accurate, objective model possible, given the limits of the data. It seems there may be a potential trade off between a model that explains more of the variability in Y (measured by R^2) and one in which we are more confident in the accuracy of that (lower amount of) variability which the model does explain. It seems you believe more accuracy is better than a larger degree of explanation, but is this generally accepted as preferable or more of a gray area?

Quote:
 Originally Posted by barneyg R^2 went from 0.32 to 0.56, is it due to the addition of 1968-1976 or those parity indicators? I would assume both.
Note: B0 is y-intercept, values given are (coefficient value, T-score), where T-score = (coeff. value / standard error) * N^.5

1980-2012 (7 variables)
R^2=.521
B0= 84, 26
Bn= (.22), 2.9
Be= (1.1), 0.96
Bg= (2.3), 9.0
Bp= 2.7, 17
Bf= 98, 19
Ba= 9.0, 1.55
Bt= 1.6, 4.5

1968-2012 (7 variables)
R^2= .589
B0=82, 72
Bn= (.19), 5.4
Be= 2.5, 2.7
Bg= (1.7), 13
Bp=2.7, 19
Bf= 91, 23
Ba= (26), 7
Bt= 2.4, 8.6

1968-2012 (4 variables)
R^2= .389
B0= 106, 131
Bn= (.65), 21
Be= 3.4, 3.4
Bg= (2.3), 18
Bp= 2.8, 17

It seems to me that going back to 1968 substantially increases our confidence in the variables, while using 2-3 additional variables substantially increases how much of the variation in Y is explained by the variables. Doing both seems to give us the highest value of R^2, while still giving us a relatively high degree of confidence in our variables. Xe is more borderline, but one reason may be that it needs to be refined a bit (it uses non-Canadian scoring forwards in top N scoring forwards... forwards needs to be replaced with "scorers" or Y needs to be changed to "forwards"... this was done due to current availability of data, and don't want to spend extra time on such calculations until model is or is nearly complete).