Using Regression to Adjust "Adjusted Points" for Top Tier Players '68-12
I ran a linear regression for '80 to '12 using data that I already had, as follows:
Y = avg. adjusted scoring of top N players (N = # teams in NHL)
Xn = Number of teams in NHL
Xg = Avg. GPG in NHL
Xe = % of top N forwards who were born outside Canada (Canadian trained players from Europe, such as Heatley & Nolan were considered Canadian)
Xp = % of total goals recorded as special teams (PP & SH) goals
Using all 4 variables, the R-squared was 99.8% and the values for each X were as follows:
Using 3 variables (Xn excluded), the R-squared was 99.7% and the values of each X were as follows:
Both appear to be very solid models for predicting the avg. adjusted scoring of the top N players each season. The average for the 32 seasons was 88.95 adj. points with a standard deviation of 3.59. With 4 variables, the predicted Y had a mean of 88.87 with the avg. absolute value of the error being 3.13, and 21/32 seasons had errors of < 1 stdev. With 3 variables, the predicted Y had a mean of 88.71 with the avg. absolute value of the error being 3.86, and 18/32 seasons had errors of < 1 stdev.
It's important to note that in both models there was a positive coefficient for Xg (league GPG), meaning that as league scoring decreased, the model predicted avg. adj. points of the top N players to decrease as well (by ~7-8 points per 1.0 point drop in league scoring).
I did this rather quickly, since I was using data readily available to me. One small flaw is that Xe measures % non-Canadian forwards in top N forwards in points, rather than % non-Canadians in top N players. It would be better if this variable and Y were aligned completely, but given the relatively few defensemen who appear in the top N in scoring, I highly doubt there would be a major effect on the results. If anything, properly aligning Y & Xe may only strengthen the relationship between them, since in more recent years when avg. adj. scoring of top N players has increased, there have been substantially more Euro/US d-men (Lidstrom, Leetch, Zubov, Gonchar, etc.). Still, the actual % of the top N is quite small, so I expect any distortions were relatively minor.
For those who understand this type of study, I certainly welcome comments, suggestions and even follow-up studies which may expand, improve or verify the results. This is what I meant by identifying, analyzing and quantifying various factors that may affect the difficulty of top level players to score adjusted points in various seasons. It can be done, and I have taken a step in that direction. I look forward to others taking further steps forward, instead of steps backward using improper analysis and/or pure speculation.
If someone who understands LINEST in Excel and/or linear regression can help me with this, that would be great:
I cannot get coefficients to generate for a particular model using LINEST. I have successfully used LINEST for other models, so it's likely a problem with the model (and its variables). The model is as follows:
one group of X variables are discrete variables for season (let's say there are 32 seasons, so it's either 0 or 1 for each possible season... it will only have a value of 1 for 1/32 of those for each player-season)
the next group of X variables are discrete variables for the player's age (if we use age ranges of 18-40, then the value is either 0 or 1 for each of 23 possible ages... and again it will only have a value of 1 for 1/23 of those for each player-season)
the next group of X variables are discrete and are for the player himself (the value is either 0 or 1 for each of the Q players in the study... and again will only have a value of 1 for 1/Q of those for each player-season)
there are possible variables that I would like to add, but if I do, will wait until I am able to successfully generate coefficients for the model as it already stands
I stopped ~ a dozen players with a total of ~170 player-seasons. I thought since the degrees of freedom are df = N - k - 1 = 170 - (32 + 23 + 12) - 1 = 170- 67 - 1 = 102, that coefficients should generate, but I'm obviously missing something and my linear regression knowledge is relatively basic and quite rusty.
Can anyone tell me why the coefficients won't generate? I don't want to put substantial time into this if it's not going to work. Any help would be appreciated.
I'll try my best to get this back on track..
I don't think you can make a judgement on how solid those models are for predicting anything based on R-squared, as that Y series is probably fairly stable. If you regressed Y on a constant you'd get a pretty high R-squared too.
My main takeaway:
Y is adjusted to 6 GPG (HR method), right? If your regression used all the players in the league, by definition you'd get Xg=0, because that's what the adjustment does. You have Xg>0 for the top 5% of players, that means the top guys are further away from the mean in high-scoring seasons than in low-scoring seasons. The adjustment may not bring down the top guys enough in high scoring seasons.
Am I correct?
No, Xn actually appeared significant to me (Bn was almost 4x SEn). The least significant appeared to be Xp with Bp ~1.5x SEp. What's strange is that the individual correlations were:
Xn = 7%, Xp = 47%, Xe = 27%, and Xg = (-10%)
I thought Xn was coincidentally capturing a lot of the other variables, so I wanted to see what the coefficients looked like without Xn as one of the variables.
Can you help with this question as well:
I know there must be a way to find a more correct estimate of the difficulty/quality of each season for top players to score points, and I believe using regression will likely yield the most correct estimate possible. Unfortunately, my skills with it are obviously limited, and no one else seems interested in pursuing this avenue, despite my previous suggestions to do so.
I have seen a measure such as 'incremental R-square' i.e. (new-old)/old where 'new' is the R2 from the full model and 'old' is the one with only an intercept or something.
In models with stable time series, people often calculate first differences i.e. Y'(1985) = Y(1985) - Y(1984) and run the model on those changes instead of the level. It changes the narrative because 'last year' is the benchmark for each observation but since your objective is often 'what makes Y change' it works. There's a deeper econometric(al?) reason that justifies first differences as well.
I'll try to find time to look at the other thread too.
Please do look at my other post if you have time at some point. It's not a whole thread, just a post with questions in a stick thread titled "Ideas for Future Studies" on this (BTN) sub-forum. I think the model I was trying to create, or a similar, workable version of such, would be the best solution in the longer term, but obviously we need a model where coefficients will generate.
Using Regression to Study Factors which Influence Scoring for Top Tier Players
The variables mentioned shouldn't really be cross-correlated, except possibly to Xg (league GPG). The other potential correlations would seem to be more coincidental than anything.
I ran another regression from 1976-2012 with y-intercept (B) and same coefficients (Mn... Mg for Xn... Xg):
Mn = (0.75)
Me = 7.15
Mg = (1.65)
Mp = 61.8
R^2 = 0.32
All variables appear significant, with the lowest t-stat that for Xg at ~ 9 (N=36, Mg/SEg ~1.5).
There is a large cross-correlation between many of the variables:
Xn & Xe = 88% ... did NHL expand in response to Euro influx? I think this is largely coincidental.
Xn & Xg = (83%) ... did expansion make goal scoring decrease? This wouldn't be the expected observation IMO.
Xn & Xp = 45% ... I don't see a logical relationship between Euro influx and increased power plays, but it's possible.
Xe & Xg = (79%) ... did Euro influx cause goal scoring to decrease? Considering Euros were disproportionately scoring forwards, this seems odd, although talent compression tends to decrease scoring IMO.
Xe & Xp = 50% ... did Euro influx cause an increase in power plays? I don't see why, esp. as it contradicts other correlation(s).
Xg & Xp = (18%)... not much of a correlation, but why would power plays and goal scoring be negatively correlated?
BTW, in case I wasn't clear before, Y = avg. of top N players' adjusted points. This means Y is per 82 games, adjusted to 6.00 gpg league avg., and assist/goal ratio of 5/3.
My main concerns with this model ares that it appears there may be important variables missing (given the low R^2) and that the variables are mostly cross-correlated. I don't think roster size is an issue during this period. What other variables may be missing? I think the cross-correlation of variables is largely coincidental, but can the variables be better defined to prevent this?
I think one factor that's not totally included in the variables is the quality of talent in the league. This is captured somewhat by % of non-Canadian forwards in top 1N (Xe), but doesn't tell you if a lot of great talent is in its peak/prime. For instance, the largest error between the predicted value and actual value is in 1996 when predicted is lower. One only has to look at the top 20+ players in scoring to see that season was full of quality forwards (although power plays increased substantially as well). Another possible variable is something for league parity (standard deviation of GF/GA ratio or GF only or GA only?).
I ran another regression from 1968-2012, and this time included two variables for parity: Xf & Xa, which are the standard deviations for each of team GF & GA, divided by the mean of team GF & GA.
The new results are:
B = 82
Mn = (0.44)
Mf = 87
Ma = (18)
Me = 8.1
Mg = (.74)
Mp = 42
R^2 = .56
All variables appear significant, with Xa having the lowest t-score of ~4.8 (Ma/SEa = .73, N^.5 = 6.6).
I think this model holds a lot of promise, with R^2 > .5, all variables significant, and I thought this was interesting as well:
Standard deviation of Y (avg. adjusted points of top N players) was 3.86, and only 3/44 predicted values of Y varied from the actual value by more than this (the highest deviation was ~1.6 std dev). Each of those three predicted values was lower than the actual value, possibly in part due to some of the best players being in the league and having strong seasons (Orr, Espo, etc. in '72... Lemieux, Jagr, etc. in '96 & '97).
Some of the varaibles may improve with further refinement. It may be useful to define a variable that will somehow capture the effect of having so much top talent in the league, but not exactly sure what the best and fairest way to do that might be. Any suggestions welcome.
For those not familiar with regression, this is what the model suggests at this stage (Y is avg. adjusted points of top N players, where N = number of teams in league):
For each additional team, Y decreases by .44
For each 1 % point increase in standard deviation of teams' GF, Y increases by .87
For each 1 % point increase in standard deviation of teams' GA, Y decreases by .18
For each 10 % point increase in % of non-Canadians in top N, Y increases by .81
For each .10 increase in league GPG, Y decreases by .07
For each 1 % point increase in special teams goals as % of total goals, Y increases by .42
I added one more variable. This variable is intended to capture some of the effect of "offensive powerhouses". A player like Orr, Gretzky or Lemieux may elevate the point totals of teammate(s), which would increase Y. A lack of parity in the league may also aid this process, but much of this should be captured in the parity variables (Xf & Xa). The variable is Xt and is defined as follows: the GF for the top 0.2N teams are added (for 21 teams, the top 4 teams, plus .2 * 5th team) and divided by 0.2N. This number is divided by league avg. GF, and one is subtracted from the result (this is the ratio to avg. by which the "powerhouses" differed from avg.). This result is then divided by Xf to scale the result based on parity (I thought this should help separate the variable from Xf and prevent much of the potential overlapping).
I ran another regression for '68-'12 and it increased R^2 to .583, with all variables again being significant. Values were as follows:
B = 78
Mn = (.24)
Mf = 91
Ma = (27)
Mg = (1.2)
Me = 4.3
Mp = 45
Mt = 2.7
Here are the value ranges for each variable from '68-'12:
Y: 82.6 to 99.7 (avg. 90)
Xn: 12 to 30 (avg. 23)
Xf: .078 to .21 (avg. .13)
Xa: .097 to .23 (avg. .14)
Xg: 5.14 to 8.03 (avg. 6.4)
Xe: 0 to .63 (avg. .27)
Xp: .214 to .385 (avg. .28)
Xt: .23 to 1.8 (avg. 1.3)
The average of the absolute error between predicted and actual was 2.11 (2.14 in 6 variable model). This compared favorably to the standard deviation of 3.86 in actual Y values.
If the variances (square of standard deviations) are used for Xf & Xa, this actually increases the R^2 to .588, decreases the avg. absolute error to 2.08, and all variables remain significant.
I still believe one of the most important missing "variables" is the presence or absence of certain great players at different times. For instance, how does one measure the fact that Ovechkin, Malkin, Crosby, and Thornton were all mostly healthy and at/near the top of their game in 2008... but in 2011, these players were mostly injured and/or off their games? It seems that incorporating discrete variables is more difficult than I initially thought, so I'm not sure how to measure this aspect of each season.
I am a major in University for Business Information Statistics. It`s good to finally see multiple regression formulas being used for the greater good ;). I hope to learn enough from you to become more understanding in fluent in my replies. Keep it up dude, pretty interesting stuff :)
Y = b1*X(s1980) + ... + b32*X(s2011) + b33*X(a18) + ... + b55*X(a40) + b56*X(p1) + ... b67*X(p12)
where X(s1980)...X(s2011) are season dummy variables (also called indicator variables), X(a18)...X(a40) are player age dummies, and X(p1)...X(p12) are player (name) dummies. I'm not sure why you want those player dummies in there (p1..p12).
But to get back to your question, it's not a question of degrees of freedom. The regressors in your model must be linearly independent, and they aren't. For example, right now for every player you have X(s2011) = 1 - X(s1980) - X(s1981) - .... - X(s2010) i.e. the sum of those 32 dummy variables is 1... same thing for the other 2 types, the sum of all variables of the same type for a given player is 1.
A simple solution is to drop one of the dummies for each type, ie. drop X(s1980), X(a18), and X(p1). You will still get an error for some age coefficients if your sample doesn't have anyone playing up to age 40 or as early as 18 but the rest of the model should work.
As for your Xt variable definition, I'm assuming you mean "the GF for the top N/5 teams"? "top 2N" would mean the top 42 teams for a 21-team league, doesn't make sense..
Let me suggest: (-> means leads to)
90s expansion -> the trap -> decreased scoring
can't trap on special teams -> increased % of PP scoring (vs total scoring)
If you don't like "the trap" you can substitute with "video analysis" or something. That would explain all correlations except those involving Euros. Then,
salary increases (late 80s) -> Euro influx
fall of the Iron Curtain (89-91) -> Euro influx
You can't really operationalize all this stuff into regressions but if you used data points from 1960 onward (therefore including the first expansions) all those correlations would decrease significantly (I would assume early-mid 70s expansion and the addition of the WHA teams led to increased scoring in the NHL, which would reverse that negative Xn/Xg relationship).
The reason for the p (player) dummy variables was to filter out the effect of the player's general value/skill, in order to produce accurate coefficients for the other variables (age, season).
Each observation of Y is one player's season. So '80 Gretzky would be: 117 = b1*X(s1980) + b34*X(a19) + b56*X(p1), where p1=Gretzky. I was hoping to simultaneously measure the effects of age and different seasons on top players' scoring, so that it could be applied to any top player in any season.
If I understand you correctly, you suggest that eliminating one dummy variable in each category will solve that problem, but I don't see how that would change the fact that the variables within each category are not independent. BTW, in the simplified (small scale) model, I still had 3+ observations for each dummy variable. IOW, there were at least 3 observations at each age, at least 3 for each season, and at least 3 for each player.
It sounds to me like this type of model just isn't possible.
My main hypothesis, regarding parity in this model, is that the more team parity there is, the more difficult it is generally for the top tier of players to exceed the average in % terms. IOW, there's not as much running up the score with meaningless points when teams are closely bunched together in terms of ability.
With Xt, I was trying to capture the "teammate/linemate" effect in an objective manner, as well as how easy it was for teams to become offensive juggernauts (even after factoring out general parity). How does one factor in the fact that Gretzky or Lemieux, e.g., may have significantly elevated one or more teammates' point totals? Also, how does one factor in the presence of phenoms such as Gretzky and Lemieux in the league? I'm just not sure how to capture these effects objectively. Eliminating such "outliers" from the Y population is changing the Y population subjectively... and where do you draw the line and stop eliminating "outliers"?... and what about their teammates? This is one reason I was trying to build the model with the dummy variables, because it would bypass this problem.
My hunch is that the "Gretzky, Lemieux & friends" effect is going to be much larger than the injury factor. It's really difficult to measure why Ovechkin went from top of the league to just another very good scorer (and the effects on Backstrom, Semin, etc.). He wasn't injured, he's still in his prime. I don't think it's easy to objectively measure that.
Salaries may have given incentive for even more Euros to join the NHL, but the fall of the Iron Curtain seems the main reason, as it allowed in the NHL a large pool of players that were previously prevented from playing there.
The most recent regressions were run from '68 to '12, but PP data goes back to '64, so I will go back that far when I next run the model.
The problem is that a lot of these changes happened in a short time: increased PPOs starting in the late 80s... salaries are ever-increasing... fall of Iron Curtain in early 90s... expansion beginning in early 90s and continuing during decade... large Euro/Russian influx in early & mid 90s... scoring decreasing in mid-90s... increased parity from the mid-90s... increased use of defensive systems and better/larger goalie equipment, etc.
Some of these would be expected to show substantial correlation and do. Some would be expected, but show a much smaller correlation or even one opposite in direction to that expected. Some wouldn't be expected to have a large correlation, yet do. It's not easy to objectively determine how these changes are influencing each other, even if we agree that the effect on the model is to make it more sensitive to additional variables.
Thanks for your insight on the model and its related components. It appears I'm a bit over my head here, which is especially difficult when tackling this as a solo project, so your help is much appreciated.
"Linear independence" is quite a bit more permissive than "independence" as usually defined in probability theory. You can have heavily correlated variables -- this creates other problems, but they will still be considered linearly independent as long as one of those variables isn't completely determined by a linear combination of the others.
That said, I would assume that going all the way back to 1964 is going to mitigate some of your problem because you'll be adding years where expansion led to increase scoring (reducing the correlation between Xn and Xg), while keeping the % of Euros relatively constant (reducing correlation of Xe with everything else).
I substituted PPO/game for PPG as % of total goals. This didn't seem to affect the model too much (R^2 rose slightly to .589). I think this is better, since it prevents a potential correlation between decreased scoring and PPG as % of goals (as you pointed out), for whatever reason.
I also ran the regression back to 1964, the first year for which I have data for power plays. This decreased R^2 to .48, which is a substantial reduction. I think the problem is that with only 6 teams, Y is based on the average of only 6 players, and other variables are based on calculations for only 6 teams. It may be best to use a model for '68-present (or perhaps an even shorter span if it's found to be much more reliable), and a separate model for the O6. For now, I will concentrate on '68-present, since there is more/better data available for those seasons, and there are more interesting and reliable changes in the variables.
Not sure your model with the dummy variables would achieve your objective per se -- the Gretzky dummy would only be equal to 1 for observations pertaining to Gretzky... Coffey would only have the Coffey dummy.. Kurri, the Kurri dummy... if Anderson had been free-riding on all those guys you still wouldn't catch it.
You're basically looking for a "quality of teammates" measure.. instead of going all the way into a different model with thousands of observations (every player, every year), couldn't you come up with a measure of how "concentrated" Y is? i.e. Y is adjusted scoring of top N players, what is the % of those N players that play on the top X teams? i.e. if in 1983-84 you have 5 Oilers and 3 Bruins out of the top 21 players, that's a "top 2 team" concentration of 8/21 = 38%. Of maybe some measure of distance between the very top player (Gretz 205) and the best player not on the Oilers (Goulet 121)..
I think going back in time mitigates this problem. If that means dropping PPG or PPO because of data availability, so be it -- you can always compare the other coefficients for (say) 1950-2010 with the model for 1964-2010 which will include it.
R^2 went from 0.32 to 0.56, is it due to the addition of 1968-1976 or those parity indicators? I would assume both.
Are correlations still as high with the 1968-2012 data?
Basically, I'm trying to remove the "Gretzky/Lemieux/Orr/? had a great year and some other players came along for the ride" effect in an objective manner, since it could substantially influence observed Y.
It appears you may have misunderstood the other potential model with dummy variables. I was planning to (once workable) include a large number of observations of Y (player-season), but still limit it to top tier players, since it's the effects on those players that we're generally most interested in (i.e. those are the players most frequently compared). Players like Rob Brown wouldn't be in that model. Players like Kurri, Coffey and possibly Anderson would be, but the Y values for each of those players often had similar "teammate effects" in multiple seasons, so I expected much of that to be captured in the dummy variable for that individual player. Again, my main objective was to use the coefficient for each season, and probably age, as the basis for further analysis. The individual player coefficients would be interesting, but the purpose of their presence would be primarily to increase the accuracies of the coefficient for variables in other categories.
1980-2012 (7 variables)
B0= 84, 26
Bn= (.22), 2.9
Be= (1.1), 0.96
Bg= (2.3), 9.0
Bp= 2.7, 17
Bf= 98, 19
Ba= 9.0, 1.55
Bt= 1.6, 4.5
1968-2012 (7 variables)
Bn= (.19), 5.4
Be= 2.5, 2.7
Bg= (1.7), 13
Bf= 91, 23
Ba= (26), 7
Bt= 2.4, 8.6
1968-2012 (4 variables)
B0= 106, 131
Bn= (.65), 21
Be= 3.4, 3.4
Bg= (2.3), 18
Bp= 2.8, 17
It seems to me that going back to 1968 substantially increases our confidence in the variables, while using 2-3 additional variables substantially increases how much of the variation in Y is explained by the variables. Doing both seems to give us the highest value of R^2, while still giving us a relatively high degree of confidence in our variables. Xe is more borderline, but one reason may be that it needs to be refined a bit (it uses non-Canadian scoring forwards in top N scoring forwards... forwards needs to be replaced with "scorers" or Y needs to be changed to "forwards"... this was done due to current availability of data, and don't want to spend extra time on such calculations until model is or is nearly complete).
I believe this is a good approach for studying various issues, but it seems to rarely be used for such... and I'm starting to understand why. :laugh:
These are the largest (absolute value of) correlations between variables in the 7 variable model run for 1968-2012:
Xn & Xe: 90%
Xf & Xa: 78%
Xn & Xf: (69%)
Xn & Xt: (65%)
Xe & Xf: (61%)
Xe & Xg: (56%)
Xn & Xa: (53%)
Xe & Xa: (50%)
Xe & Xp: 50%
Xg & Xt: 48%
Xn & Xp: 45%
Xn & Xg: (44%)
Xp & Xf: (42%)
Xp & Xa: (34%)
Xn & Xe are generally the most correlated variables. The largest correlations not involving either of these variables are bolded above. Running the model from 1968-2012 without those two variables, generates the following results:
B0= 77, 110
Bf= 102, 32
Ba= (32), 8.9
Bt= 3.3, 15
The correlations with Y in the 7 variable model ('68-12) were:
|All times are GMT -5. The time now is 01:57 AM.|
vBulletin Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
HFBoards.com, A property of CraveOnline, a division of AtomicOnline LLC ©2009 CraveOnline Media, LLC. All Rights Reserved.