By The NumbersHockey Analytics... the Final Frontier. Explore strange new worlds, to seek out new algorithms, to boldly go where no one has gone before.

Using Regression to Adjust "Adjusted Points" for Top Tier Players '68-12

Using Regression to Adjust "Adjusted Points" for Top Tier Players '68-12

I ran a linear regression for '80 to '12 using data that I already had, as follows:

Y = avg. adjusted scoring of top N players (N = # teams in NHL)

Xn = Number of teams in NHL

Xg = Avg. GPG in NHL

Xe = % of top N forwards who were born outside Canada (Canadian trained players from Europe, such as Heatley & Nolan were considered Canadian)

Xp = % of total goals recorded as special teams (PP & SH) goals

Using all 4 variables, the R-squared was 99.8% and the values for each X were as follows:

Xn= 1.05
Xg= 6.77
Xe= 16.4
Xp= 49.4

Using 3 variables (Xn excluded), the R-squared was 99.7% and the values of each X were as follows:

Xg= 7.83
Xe= 39.5
Xp= 92.8

Both appear to be very solid models for predicting the avg. adjusted scoring of the top N players each season. The average for the 32 seasons was 88.95 adj. points with a standard deviation of 3.59. With 4 variables, the predicted Y had a mean of 88.87 with the avg. absolute value of the error being 3.13, and 21/32 seasons had errors of < 1 stdev. With 3 variables, the predicted Y had a mean of 88.71 with the avg. absolute value of the error being 3.86, and 18/32 seasons had errors of < 1 stdev.

It's important to note that in both models there was a positive coefficient for Xg (league GPG), meaning that as league scoring decreased, the model predicted avg. adj. points of the top N players to decrease as well (by ~7-8 points per 1.0 point drop in league scoring).

I did this rather quickly, since I was using data readily available to me. One small flaw is that Xe measures % non-Canadian forwards in top N forwards in points, rather than % non-Canadians in top N players. It would be better if this variable and Y were aligned completely, but given the relatively few defensemen who appear in the top N in scoring, I highly doubt there would be a major effect on the results. If anything, properly aligning Y & Xe may only strengthen the relationship between them, since in more recent years when avg. adj. scoring of top N players has increased, there have been substantially more Euro/US d-men (Lidstrom, Leetch, Zubov, Gonchar, etc.). Still, the actual % of the top N is quite small, so I expect any distortions were relatively minor.

For those who understand this type of study, I certainly welcome comments, suggestions and even follow-up studies which may expand, improve or verify the results. This is what I meant by identifying, analyzing and quantifying various factors that may affect the difficulty of top level players to score adjusted points in various seasons. It can be done, and I have taken a step in that direction. I look forward to others taking further steps forward, instead of steps backward using improper analysis and/or pure speculation.

If someone who understands LINEST in Excel and/or linear regression can help me with this, that would be great:

I cannot get coefficients to generate for a particular model using LINEST. I have successfully used LINEST for other models, so it's likely a problem with the model (and its variables). The model is as follows:

one group of X variables are discrete variables for season (let's say there are 32 seasons, so it's either 0 or 1 for each possible season... it will only have a value of 1 for 1/32 of those for each player-season)

the next group of X variables are discrete variables for the player's age (if we use age ranges of 18-40, then the value is either 0 or 1 for each of 23 possible ages... and again it will only have a value of 1 for 1/23 of those for each player-season)

the next group of X variables are discrete and are for the player himself (the value is either 0 or 1 for each of the Q players in the study... and again will only have a value of 1 for 1/Q of those for each player-season)

there are possible variables that I would like to add, but if I do, will wait until I am able to successfully generate coefficients for the model as it already stands

I stopped ~ a dozen players with a total of ~170 player-seasons. I thought since the degrees of freedom are df = N - k - 1 = 170 - (32 + 23 + 12) - 1 = 170- 67 - 1 = 102, that coefficients should generate, but I'm obviously missing something and my linear regression knowledge is relatively basic and quite rusty.

Can anyone tell me why the coefficients won't generate? I don't want to put substantial time into this if it's not going to work. Any help would be appreciated.

I ran a linear regression for '80 to '12 using data that I already had, as follows:

Y = avg. adjusted scoring of top N players (N = # teams in NHL)
Xn = Number of teams in NHL
Xg = Avg. GPG in NHL
Xe = % of top N forwards who were born outside Canada (Canadian trained players from Europe, such as Heatley & Nolan were considered Canadian)
Xp = % of total goals recorded as special teams (PP & SH) goals

Using all 4 variables, the R-squared was 99.8% and the values for each X were as follows:

Xn= 1.05
Xg= 6.77
Xe= 16.4
Xp= 49.4

Using 3 variables (Xn excluded), the R-squared was 99.7% and the values of each X were as follows:

Xg= 7.83
Xe= 39.5
Xp= 92.8

Both appear to be very solid models for predicting the avg. adjusted scoring of the top N players each season. The average for the 32 seasons was 88.95 adj. points with a standard deviation of 3.59. With 4 variables, the predicted Y had a mean of 88.87 with the avg. absolute value of the error being 3.13, and 21/32 seasons had errors of < 1 stdev. With 3 variables, the predicted Y had a mean of 88.71 with the avg. absolute value of the error being 3.86, and 18/32 seasons had errors of < 1 stdev.

It's important to note that in both models there was a positive coefficient for Xg (league GPG), meaning that as league scoring decreased, the model predicted avg. adj. points of the top N players to decrease as well (by ~7-8 points per 1.0 point drop in league scoring).

(...)

For those who understand this type of study, I certainly welcome comments, suggestions and even follow-up studies which may expand, improve or verify the results. This is what I meant by identifying, analyzing and quantifying various factors that may affect the difficulty of top level players to score adjusted points in various seasons. It can be done, and I have taken a step in that direction. I look forward to others taking further steps forward, instead of steps backward using improper analysis and/or pure speculation.

You report coefficients on each regressor but it's hard to really make sense of the results without the t-statistics. Given the insignificant drop in R-squared when you drop Xn I would assume that 1.05 coefficient is insignificant but I'd like to see the others.

I don't think you can make a judgement on how solid those models are for predicting anything based on R-squared, as that Y series is probably fairly stable. If you regressed Y on a constant you'd get a pretty high R-squared too.

My main takeaway:

Y is adjusted to 6 GPG (HR method), right? If your regression used all the players in the league, by definition you'd get Xg=0, because that's what the adjustment does. You have Xg>0 for the top 5% of players, that means the top guys are further away from the mean in high-scoring seasons than in low-scoring seasons. The adjustment may not bring down the top guys enough in high scoring seasons.

You report coefficients on each regressor but it's hard to really make sense of the results without the t-statistics. Given the insignificant drop in R-squared when you drop Xn I would assume that 1.05 coefficient is insignificant but I'd like to see the others.

No, Xn actually appeared significant to me (Bn was almost 4x SEn). The least significant appeared to be Xp with Bp ~1.5x SEp. What's strange is that the individual correlations were:

Xn = 7%, Xp = 47%, Xe = 27%, and Xg = (-10%)

I thought Xn was coincidentally capturing a lot of the other variables, so I wanted to see what the coefficients looked like without Xn as one of the variables.

Quote:

Originally Posted by barneyg

I don't think you can make a judgement on how solid those models are for predicting anything based on R-squared, as that Y series is probably fairly stable. If you regressed Y on a constant you'd get a pretty high R-squared too.

What's the best way to judge models with a relatively stable Y? Just look at the significance of each individual coefficient or is there a better way of judging/comparing models in such cases?

Quote:

Originally Posted by barneyg

My main takeaway:

Y is adjusted to 6 GPG (HR method), right? If your regression used all the players in the league, by definition you'd get Xg=0, because that's what the adjustment does. You have Xg>0 for the top 5% of players, that means the top guys are further away from the mean in high-scoring seasons than in low-scoring seasons. The adjustment may not bring down the top guys enough in high scoring seasons.

Am I correct?

Yes, you seem to understand the process and model well, and are correct in each case. One caveat is that Xg had a small, negative correlation (-10%), but when used as a variable in the models the coefficient became positive and was also the most significant in the 4-variable model (Bg ~7x SEg).

I know there must be a way to find a more correct estimate of the difficulty/quality of each season for top players to score points, and I believe using regression will likely yield the most correct estimate possible. Unfortunately, my skills with it are obviously limited, and no one else seems interested in pursuing this avenue, despite my previous suggestions to do so.

No, Xn actually appeared significant to me (Bn was almost 4x SEn). The least significant appeared to be Xp with Bp ~1.5x SEp. What's strange is that the individual correlations were:

Xn = 7%, Xp = 47%, Xe = 27%, and Xg = (-10%)

I thought Xn was coincidentally capturing a lot of the other variables, so I wanted to see what the coefficients looked like without Xn as one of the variables.

What are the correlations between those variables? I'd assume Xp is strongly correlated with them if it becomes the least significant variable once the others are included. That said, with 30 data points any t-stat above ~2 will be significant and the t-statistic on Bp seems to be 1.5*5.5 = 8.25 (square root of 30 = 5.5).

Quote:

Originally Posted by Czech Your Math

What's the best way to judge models with a relatively stable Y? Just look at the significance of each individual coefficient or is there a better way of judging/comparing models in such cases?

I'm not an econometrician (I don't play one on TV either), but I know that R-square doesn't mean squat when no intercept is included in the model. Was there one in yours?

I have seen a measure such as 'incremental R-square' i.e. (new-old)/old where 'new' is the R2 from the full model and 'old' is the one with only an intercept or something.

In models with stable time series, people often calculate first differences i.e. Y'(1985) = Y(1985) - Y(1984) and run the model on those changes instead of the level. It changes the narrative because 'last year' is the benchmark for each observation but since your objective is often 'what makes Y change' it works. There's a deeper econometric(al?) reason that justifies first differences as well.

I'll try to find time to look at the other thread too.

What are the correlations between those variables? I'd assume Xp is strongly correlated with them if it becomes the least significant variable once the others are included. That said, with 30 data points any t-stat above ~2 will be significant and the t-statistic on Bp seems to be 1.5*5.5 = 8.25 (square root of 30 = 5.5).

I haven't calculated cross-correlations between the variables, but would expect all 4 to have substantial correlations to one another (including negative correlations with Xg), because the changes in each variable are relatively contemporaneous (all of them tended to occur in the 90s... after the 80s and relatively constant after 90s).

Quote:

Originally Posted by barneyg

I'm not an econometrician (I don't play one on TV either), but I know that R-square doesn't mean squat when no intercept is included in the model. Was there one in yours?

I was trying to include an intercept [using LINEST in Excel with paramaters (Y range, X ranges, false, true) ], but the intercept calculated as zero. I'll have to check that again, perhaps the false should be a true (but I thought it said that parameter was B = 0, so I entered that as false).

Quote:

Originally Posted by barneyg

In models with stable time series, people often calculate first differences i.e. Y'(1985) = Y(1985) - Y(1984) and run the model on those changes instead of the level. It changes the narrative because 'last year' is the benchmark for each observation but since your objective is often 'what makes Y change' it works. There's a deeper econometric(al?) reason that justifies first differences as well.

I'll try to find time to look at the other thread too.

Yes, I thought about calculating differences first, as I did in an unrelated model. Do you think % (ratio) differences would be better than raw differences?

Please do look at my other post if you have time at some point. It's not a whole thread, just a post with questions in a stick thread titled "Ideas for Future Studies" on this (BTN) sub-forum. I think the model I was trying to create, or a similar, workable version of such, would be the best solution in the longer term, but obviously we need a model where coefficients will generate.

Using Regression to Study Factors which Influence Scoring for Top Tier Players

The variables mentioned shouldn't really be cross-correlated, except possibly to Xg (league GPG). The other potential correlations would seem to be more coincidental than anything.

All variables appear significant, with the lowest t-stat that for Xg at ~ 9 (N=36, Mg/SEg ~1.5).

There is a large cross-correlation between many of the variables:

Xn & Xe = 88% ... did NHL expand in response to Euro influx? I think this is largely coincidental.
Xn & Xg = (83%) ... did expansion make goal scoring decrease? This wouldn't be the expected observation IMO.
Xn & Xp = 45% ... I don't see a logical relationship between Euro influx and increased power plays, but it's possible.
Xe & Xg = (79%) ... did Euro influx cause goal scoring to decrease? Considering Euros were disproportionately scoring forwards, this seems odd, although talent compression tends to decrease scoring IMO.
Xe & Xp = 50% ... did Euro influx cause an increase in power plays? I don't see why, esp. as it contradicts other correlation(s).
Xg & Xp = (18%)... not much of a correlation, but why would power plays and goal scoring be negatively correlated?

BTW, in case I wasn't clear before, Y = avg. of top N players' adjusted points. This means Y is per 82 games, adjusted to 6.00 gpg league avg., and assist/goal ratio of 5/3.

My main concerns with this model ares that it appears there may be important variables missing (given the low R^2) and that the variables are mostly cross-correlated. I don't think roster size is an issue during this period. What other variables may be missing? I think the cross-correlation of variables is largely coincidental, but can the variables be better defined to prevent this?

I think one factor that's not totally included in the variables is the quality of talent in the league. This is captured somewhat by % of non-Canadian forwards in top 1N (Xe), but doesn't tell you if a lot of great talent is in its peak/prime. For instance, the largest error between the predicted value and actual value is in 1996 when predicted is lower. One only has to look at the top 20+ players in scoring to see that season was full of quality forwards (although power plays increased substantially as well). Another possible variable is something for league parity (standard deviation of GF/GA ratio or GF only or GA only?).

Last edited by Czech Your Math: 11-16-2012 at 02:58 AM.

I ran another regression from 1968-2012, and this time included two variables for parity: Xf & Xa, which are the standard deviations for each of team GF & GA, divided by the mean of team GF & GA.

The new results are:

B = 82
Mn = (0.44)
Mf = 87
Ma = (18)
Me = 8.1
Mg = (.74)
Mp = 42

R^2 = .56

All variables appear significant, with Xa having the lowest t-score of ~4.8 (Ma/SEa = .73, N^.5 = 6.6).

I think this model holds a lot of promise, with R^2 > .5, all variables significant, and I thought this was interesting as well:

Standard deviation of Y (avg. adjusted points of top N players) was 3.86, and only 3/44 predicted values of Y varied from the actual value by more than this (the highest deviation was ~1.6 std dev). Each of those three predicted values was lower than the actual value, possibly in part due to some of the best players being in the league and having strong seasons (Orr, Espo, etc. in '72... Lemieux, Jagr, etc. in '96 & '97).

Some of the varaibles may improve with further refinement. It may be useful to define a variable that will somehow capture the effect of having so much top talent in the league, but not exactly sure what the best and fairest way to do that might be. Any suggestions welcome.

For those not familiar with regression, this is what the model suggests at this stage (Y is avg. adjusted points of top N players, where N = number of teams in league):

For each additional team, Y decreases by .44
For each 1 % point increase in standard deviation of teams' GF, Y increases by .87
For each 1 % point increase in standard deviation of teams' GA, Y decreases by .18
For each 10 % point increase in % of non-Canadians in top N, Y increases by .81
For each .10 increase in league GPG, Y decreases by .07
For each 1 % point increase in special teams goals as % of total goals, Y increases by .42

Last edited by Czech Your Math: 11-16-2012 at 11:22 PM.

I added one more variable. This variable is intended to capture some of the effect of "offensive powerhouses". A player like Orr, Gretzky or Lemieux may elevate the point totals of teammate(s), which would increase Y. A lack of parity in the league may also aid this process, but much of this should be captured in the parity variables (Xf & Xa). The variable is Xt and is defined as follows: the GF for the top 0.2N teams are added (for 21 teams, the top 4 teams, plus .2 * 5th team) and divided by 0.2N. This number is divided by league avg. GF, and one is subtracted from the result (this is the ratio to avg. by which the "powerhouses" differed from avg.). This result is then divided by Xf to scale the result based on parity (I thought this should help separate the variable from Xf and prevent much of the potential overlapping).

I ran another regression for '68-'12 and it increased R^2 to .583, with all variables again being significant. Values were as follows:

B = 78
Mn = (.24)
Mf = 91
Ma = (27)
Mg = (1.2)
Me = 4.3
Mp = 45
Mt = 2.7

Here are the value ranges for each variable from '68-'12:

Y: 82.6 to 99.7 (avg. 90)
Xn: 12 to 30 (avg. 23)
Xf: .078 to .21 (avg. .13)
Xa: .097 to .23 (avg. .14)
Xg: 5.14 to 8.03 (avg. 6.4)
Xe: 0 to .63 (avg. .27)
Xp: .214 to .385 (avg. .28)
Xt: .23 to 1.8 (avg. 1.3)

The average of the absolute error between predicted and actual was 2.11 (2.14 in 6 variable model). This compared favorably to the standard deviation of 3.86 in actual Y values.

If the variances (square of standard deviations) are used for Xf & Xa, this actually increases the R^2 to .588, decreases the avg. absolute error to 2.08, and all variables remain significant.

I still believe one of the most important missing "variables" is the presence or absence of certain great players at different times. For instance, how does one measure the fact that Ovechkin, Malkin, Crosby, and Thornton were all mostly healthy and at/near the top of their game in 2008... but in 2011, these players were mostly injured and/or off their games? It seems that incorporating discrete variables is more difficult than I initially thought, so I'm not sure how to measure this aspect of each season.

Last edited by Czech Your Math: 11-19-2012 at 03:18 PM.
Reason: 0.2N mistakenly written as 2N

I am a major in University for Business Information Statistics. It`s good to finally see multiple regression formulas being used for the greater good . I hope to learn enough from you to become more understanding in fluent in my replies. Keep it up dude, pretty interesting stuff

I cannot get coefficients to generate for a particular model using LINEST. I have successfully used LINEST for other models, so it's likely a problem with the model (and its variables). The model is as follows:

one group of X variables are discrete variables for season (let's say there are 32 seasons, so it's either 0 or 1 for each possible season... it will only have a value of 1 for 1/32 of those for each player-season)

the next group of X variables are discrete variables for the player's age (if we use age ranges of 18-40, then the value is either 0 or 1 for each of 23 possible ages... and again it will only have a value of 1 for 1/23 of those for each player-season)

the next group of X variables are discrete and are for the player himself (the value is either 0 or 1 for each of the Q players in the study... and again will only have a value of 1 for 1/Q of those for each player-season)

there are possible variables that I would like to add, but if I do, will wait until I am able to successfully generate coefficients for the model as it already stands

I stopped ~ a dozen players with a total of ~170 player-seasons. I thought since the degrees of freedom are df = N - k - 1 = 170 - (32 + 23 + 12) - 1 = 170- 67 - 1 = 102, that coefficients should generate, but I'm obviously missing something and my linear regression knowledge is relatively basic and quite rusty.

Can anyone tell me why the coefficients won't generate? I don't want to put substantial time into this if it's not going to work. Any help would be appreciated.

where X(s1980)...X(s2011) are season dummy variables (also called indicator variables), X(a18)...X(a40) are player age dummies, and X(p1)...X(p12) are player (name) dummies. I'm not sure why you want those player dummies in there (p1..p12).

But to get back to your question, it's not a question of degrees of freedom. The regressors in your model must be linearly independent, and they aren't. For example, right now for every player you have X(s2011) = 1 - X(s1980) - X(s1981) - .... - X(s2010) i.e. the sum of those 32 dummy variables is 1... same thing for the other 2 types, the sum of all variables of the same type for a given player is 1.

A simple solution is to drop one of the dummies for each type, ie. drop X(s1980), X(a18), and X(p1). You will still get an error for some age coefficients if your sample doesn't have anyone playing up to age 40 or as early as 18 but the rest of the model should work.

I haven't calculated cross-correlations between the variables, but would expect all 4 to have substantial correlations to one another (including negative correlations with Xg), because the changes in each variable are relatively contemporaneous (all of them tended to occur in the 90s... after the 80s and relatively constant after 90s).

The main reason it may be important is that if the correlation between 2 regressors is very high, the coefficient on those 2 variables will be very sensitive to the use of the other variables included in the regression. In English: when 2 variables are closely related, the associations suggested by OLS could be strange or at least misleading.

Quote:

Originally Posted by Czech Your Math

I was trying to include an intercept [using LINEST in Excel with paramaters (Y range, X ranges, false, true) ], but the intercept calculated as zero. I'll have to check that again, perhaps the false should be a true (but I thought it said that parameter was B = 0, so I entered that as false).

I'm not familiar with LINEST but that 3rd argument should be TRUE if you want to look at R-square.

Quote:

Originally Posted by Czech Your Math

Yes, I thought about calculating differences first, as I did in an unrelated model. Do you think % (ratio) differences would be better than raw differences?

I don't have a good sense of which one is better (raw differences vs % differences). I think you want a Y that can easily be interpreted, do you want "change in adjusted scoring for the top players" or "% change in adjusted scoring for the top players". Seems to me like raw differences might be better here.

I added one more variable. This variable is intended to capture some of the effect of "offensive powerhouses". A player like Orr, Gretzky or Lemieux may elevate the point totals of teammate(s), which would increase Y. A lack of parity in the league may also aid this process, but much of this should be captured in the parity variables (Xf & Xa). The variable is Xt and is defined as follows: the GF for the top 2N teams are added (for 21 teams, the top 4 teams, plus .2 * 5th team) and divided by 2N. This number is divided by league avg. GF, and one is subtracted from the result (this is the ratio to avg. by which the "powerhouses" differed from avg.). This result is then divided by Xf to scale the result based on parity (I thought this should help separate the variable from Xf and prevent much of the potential overlapping).

I'm not sure how to interpret those 3 extra variables. The R-square doesn't increase all that much when you add Xt but the coefficients on the main variables are all over the place. What's your rationale for a link between the concepts "parity" and "top player scoring"? Define that hypothesis, and then pick a construct that you think represents it best -- 3 variables makes it too tough to interpret IMO.

As for your Xt variable definition, I'm assuming you mean "the GF for the top N/5 teams"? "top 2N" would mean the top 42 teams for a 21-team league, doesn't make sense..

Quote:

Originally Posted by Czech Your Math

I still believe one of the most important missing "variables" is the presence or absence of certain great players at different times. For instance, how does one measure the fact that Ovechkin, Malkin, Crosby, and Thornton were all mostly healthy and at/near the top of their game in 2008... but in 2011, these players were mostly injured and/or off their games? It seems that incorporating discrete variables is more difficult than I initially thought, so I'm not sure how to measure this aspect of each season.

How about a variable for % of games missed by the people that were included in Y the previous year? or % of games missed by the top x% of people with the highest points/game?

Quote:

Originally Posted by Czech Your Math

There is a large cross-correlation between many of the variables:

Xn & Xe = 88% ... did NHL expand in response to Euro influx? I think this is largely coincidental.
Xn & Xg = (83%) ... did expansion make goal scoring decrease? This wouldn't be the expected observation IMO.
Xn & Xp = 45% ... I don't see a logical relationship between Euro influx and increased power plays, but it's possible.
Xe & Xg = (79%) ... did Euro influx cause goal scoring to decrease? Considering Euros were disproportionately scoring forwards, this seems odd, although talent compression tends to decrease scoring IMO.
Xe & Xp = 50% ... did Euro influx cause an increase in power plays? I don't see why, esp. as it contradicts other correlation(s).
Xg & Xp = (18%)... not much of a correlation, but why would power plays and goal scoring be negatively correlated?

BTW, in case I wasn't clear before, Y = avg. of top N players' adjusted points. This means Y is per 82 games, adjusted to 6.00 gpg league avg., and assist/goal ratio of 5/3.

My main concerns with this model ares that it appears there may be important variables missing (given the low R^2) and that the variables are mostly cross-correlated. I don't think roster size is an issue during this period. What other variables may be missing? I think the cross-correlation of variables is largely coincidental, but can the variables be better defined to prevent this?

The 'source' of the correlation ('coincidental' or otherwise) doesn't really matter -- as I previously wrote, multicollinearity makes the coefficients unstable and sensitive to the variables you include next: look at what happened with Mn, Mg and Me when you added Xt as a 7th regressor.

Let me suggest: (-> means leads to)
90s expansion -> the trap -> decreased scoring
and
can't trap on special teams -> increased % of PP scoring (vs total scoring)

If you don't like "the trap" you can substitute with "video analysis" or something. That would explain all correlations except those involving Euros. Then,

salary increases (late 80s) -> Euro influx
fall of the Iron Curtain (89-91) -> Euro influx

You can't really operationalize all this stuff into regressions but if you used data points from 1960 onward (therefore including the first expansions) all those correlations would decrease significantly (I would assume early-mid 70s expansion and the addition of the WHA teams led to increased scoring in the NHL, which would reverse that negative Xn/Xg relationship).

where X(s1980)...X(s2011) are season dummy variables (also called indicator variables), X(a18)...X(a40) are player age dummies, and X(p1)...X(p12) are player (name) dummies. I'm not sure why you want those player dummies in there (p1..p12).

Yes, you understand that model, which I was trying to build.

The reason for the p (player) dummy variables was to filter out the effect of the player's general value/skill, in order to produce accurate coefficients for the other variables (age, season).

Each observation of Y is one player's season. So '80 Gretzky would be: 117 = b1*X(s1980) + b34*X(a19) + b56*X(p1), where p1=Gretzky. I was hoping to simultaneously measure the effects of age and different seasons on top players' scoring, so that it could be applied to any top player in any season.

Quote:

Originally Posted by barneyg

But to get back to your question, it's not a question of degrees of freedom. The regressors in your model must be linearly independent, and they aren't. For example, right now for every player you have X(s2011) = 1 - X(s1980) - X(s1981) - .... - X(s2010) i.e. the sum of those 32 dummy variables is 1... same thing for the other 2 types, the sum of all variables of the same type for a given player is 1.

A simple solution is to drop one of the dummies for each type, ie. drop X(s1980), X(a18), and X(p1). You will still get an error for some age coefficients if your sample doesn't have anyone playing up to age 40 or as early as 18 but the rest of the model should work.

Thanks, this is very insightful. You are right, the variables within each category are not independent, since only one dummy variable in each category will have value = 1, and the rest 0, and so always sum to 1.

If I understand you correctly, you suggest that eliminating one dummy variable in each category will solve that problem, but I don't see how that would change the fact that the variables within each category are not independent. BTW, in the simplified (small scale) model, I still had 3+ observations for each dummy variable. IOW, there were at least 3 observations at each age, at least 3 for each season, and at least 3 for each player.

It sounds to me like this type of model just isn't possible.

The main reason it may be important is that if the correlation between 2 regressors is very high, the coefficient on those 2 variables will be very sensitive to the use of the other variables included in the regression. In English: when 2 variables are closely related, the associations suggested by OLS could be strange or at least misleading.

I can see why that could be important. However, many of the high cross-correlations seem to have been in large part due to contemporaneous changes in the trends of the variables which don't have much logical basis. IOW, in many cases I would have expected much less correlation or even correlation in the opposite direction.

Quote:

Originally Posted by barneyg

I don't have a good sense of which one is better (raw differences vs % differences). I think you want a Y that can easily be interpreted, do you want "change in adjusted scoring for the top players" or "% change in adjusted scoring for the top players". Seems to me like raw differences might be better here.

I think either one (raw or %) could be interpreted properly. In general, I think the raw difference model may be a bit easier to work with and may be more logical when looking at non-consecutive seasons.

I'm not sure how to interpret those 3 extra variables. The R-square doesn't increase all that much when you add Xt but the coefficients on the main variables are all over the place. What's your rationale for a link between the concepts "parity" and "top player scoring"? Define that hypothesis, and then pick a construct that you think represents it best -- 3 variables makes it too tough to interpret IMO.

I think Xf & Xa (parity of team GF & GA) are important, but I agree that Xt doesn't seem a crucial variable.

My main hypothesis, regarding parity in this model, is that the more team parity there is, the more difficult it is generally for the top tier of players to exceed the average in % terms. IOW, there's not as much running up the score with meaningless points when teams are closely bunched together in terms of ability.

With Xt, I was trying to capture the "teammate/linemate" effect in an objective manner, as well as how easy it was for teams to become offensive juggernauts (even after factoring out general parity). How does one factor in the fact that Gretzky or Lemieux, e.g., may have significantly elevated one or more teammates' point totals? Also, how does one factor in the presence of phenoms such as Gretzky and Lemieux in the league? I'm just not sure how to capture these effects objectively. Eliminating such "outliers" from the Y population is changing the Y population subjectively... and where do you draw the line and stop eliminating "outliers"?... and what about their teammates? This is one reason I was trying to build the model with the dummy variables, because it would bypass this problem.

Quote:

Originally Posted by barneyg

As for your Xt variable definition, I'm assuming you mean "the GF for the top N/5 teams"? "top 2N" would mean the top 42 teams for a 21-team league, doesn't make sense..

My mistake, 2N should have been listed as 0.2N.

Quote:

Originally Posted by barneyg

How about a variable for % of games missed by the people that were included in Y the previous year? or % of games missed by the top x% of people with the highest points/game?

Those are interesting ideas. The % of game missed by players in previous season's "Y" would take a lot of time. The % games missed by top x% of players seems more practical, although would probably need to set a minimum number of games. I think it will probably be marginally useful (like Xt) or insignificant, but it's still worth using a practical variable for injuries and seeing if it adds to the model's significance.

My hunch is that the "Gretzky, Lemieux & friends" effect is going to be much larger than the injury factor. It's really difficult to measure why Ovechkin went from top of the league to just another very good scorer (and the effects on Backstrom, Semin, etc.). He wasn't injured, he's still in his prime. I don't think it's easy to objectively measure that.

Quote:

Originally Posted by barneyg

The 'source' of the correlation ('coincidental' or otherwise) doesn't really matter -- as I previously wrote, multicollinearity makes the coefficients unstable and sensitive to the variables you include next: look at what happened with Mn, Mg and Me when you added Xt as a 7th regressor.

You're right, but it seems unavoidable, unless some of those variables are re-defined.

Quote:

Originally Posted by barneyg

Let me suggest: (-> means leads to)
90s expansion -> the trap -> decreased scoring
and
can't trap on special teams -> increased % of PP scoring (vs total scoring)

If you don't like "the trap" you can substitute with "video analysis" or something. That would explain all correlations except those involving Euros. Then,

salary increases (late 80s) -> Euro influx
fall of the Iron Curtain (89-91) -> Euro influx

You can't really operationalize all this stuff into regressions but if you used data points from 1960 onward (therefore including the first expansions) all those correlations would decrease significantly (I would assume early-mid 70s expansion and the addition of the WHA teams led to increased scoring in the NHL, which would reverse that negative Xn/Xg relationship).

I can see how "the trap" could possibly, by itself, lead to increased PP goals as % of total goals. However, PP opportunities (PPO) were generally substantially lower until the mid-late 80s than they were up to that point. Maybe I should use define Xp as PPO/game instead of PPG/Total Goals?

Salaries may have given incentive for even more Euros to join the NHL, but the fall of the Iron Curtain seems the main reason, as it allowed in the NHL a large pool of players that were previously prevented from playing there.

The most recent regressions were run from '68 to '12, but PP data goes back to '64, so I will go back that far when I next run the model.

The problem is that a lot of these changes happened in a short time: increased PPOs starting in the late 80s... salaries are ever-increasing... fall of Iron Curtain in early 90s... expansion beginning in early 90s and continuing during decade... large Euro/Russian influx in early & mid 90s... scoring decreasing in mid-90s... increased parity from the mid-90s... increased use of defensive systems and better/larger goalie equipment, etc.

Some of these would be expected to show substantial correlation and do. Some would be expected, but show a much smaller correlation or even one opposite in direction to that expected. Some wouldn't be expected to have a large correlation, yet do. It's not easy to objectively determine how these changes are influencing each other, even if we agree that the effect on the model is to make it more sensitive to additional variables.

Thanks for your insight on the model and its related components. It appears I'm a bit over my head here, which is especially difficult when tackling this as a solo project, so your help is much appreciated.

Last edited by Czech Your Math: 11-19-2012 at 03:22 PM.

Thanks, this is very insightful. You are right, the variables within each category are not independent, since only one dummy variable in each category will have value = 1, and the rest 0, and so always sum to 1.

If I understand you correctly, you suggest that eliminating one dummy variable in each category will solve that problem, but I don't see how that would change the fact that the variables within each category are not independent. BTW, in the simplified (small scale) model, I still had 3+ observations for each dummy variable. IOW, there were at least 3 observations at each age, at least 3 for each season, and at least 3 for each player.

It sounds to me like this type of model just isn't possible.

It's definitely possible -- the only reason you aren't getting a solution is that you fell into the dummy variable trap. If you remove the 2011 dummy, none of the remaining dummies can be expressed as a linear combination of the others because it's no longer true that the sum of all dummies is equal to 1 for each observation (as the sum will be 0 for the 2011 observations).

"Linear independence" is quite a bit more permissive than "independence" as usually defined in probability theory. You can have heavily correlated variables -- this creates other problems, but they will still be considered linearly independent as long as one of those variables isn't completely determined by a linear combination of the others.

I can see why that could be important. However, many of the high cross-correlations seem to have been in large part due to contemporaneous changes in the trends of the variables which don't have much logical basis. IOW, in many cases I would have expected much less correlation or even correlation in the opposite direction.

As I wrote in the post that followed, it doesn't really matter whether those correlations make intuitive sense -- high correlations in the regressors can mess up the results big time.

That said, I would assume that going all the way back to 1964 is going to mitigate some of your problem because you'll be adding years where expansion led to increase scoring (reducing the correlation between Xn and Xg), while keeping the % of Euros relatively constant (reducing correlation of Xe with everything else).

It's definitely possible -- the only reason you aren't getting a solution is that you fell into the dummy variable trap. If you remove the 2011 dummy, none of the remaining dummies can be expressed as a linear combination of the others because it's no longer true that the sum of all dummies is equal to 1 for each observation (as the sum will be 0 for the 2011 observations).

"Linear independence" is quite a bit more permissive than "independence" as usually defined in probability theory. You can have heavily correlated variables -- this creates other problems, but they will still be considered linearly independent as long as one of those variables isn't completely determined by a linear combination of the others.

I removed one dummy variable from each category, but the coefficients still won't generate. Perhaps I misunderstood your suggestion (I removed X(s1980), X(a40) and X(p12)... p12 was the dummy variable for Ray Bourque).

Quote:

Originally Posted by barneyg

As I wrote in the post that followed, it doesn't really matter whether those correlations make intuitive sense -- high correlations in the regressors can mess up the results big time.

That said, I would assume that going all the way back to 1964 is going to mitigate some of your problem because you'll be adding years where expansion led to increase scoring (reducing the correlation between Xn and Xg), while keeping the % of Euros relatively constant (reducing correlation of Xe with everything else).

I'm not sure how to bypass the problem of high correlation between many of the variables.

I substituted PPO/game for PPG as % of total goals. This didn't seem to affect the model too much (R^2 rose slightly to .589). I think this is better, since it prevents a potential correlation between decreased scoring and PPG as % of goals (as you pointed out), for whatever reason.

I also ran the regression back to 1964, the first year for which I have data for power plays. This decreased R^2 to .48, which is a substantial reduction. I think the problem is that with only 6 teams, Y is based on the average of only 6 players, and other variables are based on calculations for only 6 teams. It may be best to use a model for '68-present (or perhaps an even shorter span if it's found to be much more reliable), and a separate model for the O6. For now, I will concentrate on '68-present, since there is more/better data available for those seasons, and there are more interesting and reliable changes in the variables.

With Xt, I was trying to capture the "teammate/linemate" effect in an objective manner, as well as how easy it was for teams to become offensive juggernauts (even after factoring out general parity). How does one factor in the fact that Gretzky or Lemieux, e.g., may have significantly elevated one or more teammates' point totals? Also, how does one factor in the presence of phenoms such as Gretzky and Lemieux in the league? I'm just not sure how to capture these effects objectively. Eliminating such "outliers" from the Y population is changing the Y population subjectively... and where do you draw the line and stop eliminating "outliers"?... and what about their teammates? This is one reason I was trying to build the model with the dummy variables, because it would bypass this problem.

My hunch is that the "Gretzky, Lemieux & friends" effect is going to be much larger than the injury factor. It's really difficult to measure why Ovechkin went from top of the league to just another very good scorer (and the effects on Backstrom, Semin, etc.). He wasn't injured, he's still in his prime. I don't think it's easy to objectively measure that.

Your math problem (Gretzky's effect on Anderson, Lemieux's effect on Rob Brown..) is the anti-stats people complaint in a nutshell -- "hockey's a team game, so individual stats have to be flawed and you can't predict them". My personal opinion is exactly the same if you replace "you can't" with "it's tough to".

Not sure your model with the dummy variables would achieve your objective per se -- the Gretzky dummy would only be equal to 1 for observations pertaining to Gretzky... Coffey would only have the Coffey dummy.. Kurri, the Kurri dummy... if Anderson had been free-riding on all those guys you still wouldn't catch it.

You're basically looking for a "quality of teammates" measure.. instead of going all the way into a different model with thousands of observations (every player, every year), couldn't you come up with a measure of how "concentrated" Y is? i.e. Y is adjusted scoring of top N players, what is the % of those N players that play on the top X teams? i.e. if in 1983-84 you have 5 Oilers and 3 Bruins out of the top 21 players, that's a "top 2 team" concentration of 8/21 = 38%. Of maybe some measure of distance between the very top player (Gretz 205) and the best player not on the Oilers (Goulet 121)..

Quote:

Originally Posted by Czech Your Math

The problem is that a lot of these changes happened in a short time: increased PPOs starting in the late 80s... salaries are ever-increasing... fall of Iron Curtain in early 90s... expansion beginning in early 90s and continuing during decade... large Euro/Russian influx in early & mid 90s... scoring decreasing in mid-90s... increased parity from the mid-90s... increased use of defensive systems and better/larger goalie equipment, etc.

Some of these would be expected to show substantial correlation and do. Some would be expected, but show a much smaller correlation or even one opposite in direction to that expected. Some wouldn't be expected to have a large correlation, yet do. It's not easy to objectively determine how these changes are influencing each other, even if we agree that the effect on the model is to make it more sensitive to additional variables.

I would rephrase the bolded as: I don't really care how these changes influenced each other, all I want is to find a way to make the model more stable with respect to the implied relationship between each of those variables and adjusted scoring.

I think going back in time mitigates this problem. If that means dropping PPG or PPO because of data availability, so be it -- you can always compare the other coefficients for (say) 1950-2010 with the model for 1964-2010 which will include it.

R^2 went from 0.32 to 0.56, is it due to the addition of 1968-1976 or those parity indicators? I would assume both.

I removed one dummy variable from each category, but the coefficients still won't generate. Perhaps I misunderstood your suggestion (I removed X(s1980), X(a40) and X(p12)... p12 was the dummy variable for Ray Bourque).

Make sure you have no single dummy (ex: X(a39)) for which all values are exactly 0 (or 1, though that shouldn't be the case if you coded the data correctly.. the intercept should be the only variable equal to 1 for all observations).

Quote:

Originally Posted by Czech Your Math

I'm not sure how to bypass the problem of high correlation between many of the variables.

I substituted PPO/game for PPG as % of total goals. This didn't seem to affect the model too much (R^2 rose slightly to .589). I think this is better, since it prevents a potential correlation between decreased scoring and PPG as % of goals (as you pointed out), for whatever reason.

I also ran the regression back to 1964, the first year for which I have data for power plays. This decreased R^2 to .48, which is a substantial reduction. I think the problem is that with only 6 teams, Y is based on the average of only 6 players, and other variables are based on calculations for only 6 teams. It may be best to use a model for '68-present (or perhaps an even shorter span if it's found to be much more reliable), and a separate model for the O6. For now, I will concentrate on '68-present, since there is more/better data available for those seasons, and there are more interesting and reliable changes in the variables.

Unless you're strictly in the prediction business, it's no big deal if you get a lower R^2 as long as you get more consistent estimates (i.e. reduced standard errors on the coefficients).

Are correlations still as high with the 1968-2012 data?

Last edited by barneyg: 11-19-2012 at 07:02 PM.
Reason: rephrased the first sentence (double negative)

Your math problem (Gretzky's effect on Anderson, Lemieux's effect on Rob Brown..) is the anti-stats people complaint in a nutshell -- "hockey's a team game, so individual stats have to be flawed and you can't predict them". My personal opinion is exactly the same if you replace "you can't" with "it's tough to".

I'm not really trying to predict future values of Y, although I would expect a good model to have good predictive value (once the values for the variables are input). I'm more interested in trying to level the ground for comparison. From what I've seen, the production of a Gretzky or Lemieux is usually not heavily influenced by the linemate/teammate effect, while the production of their teammates often is heavily influenced by playing with such great players.

Basically, I'm trying to remove the "Gretzky/Lemieux/Orr/? had a great year and some other players came along for the ride" effect in an objective manner, since it could substantially influence observed Y.

Quote:

Originally Posted by barneyg

Not sure your model with the dummy variables would achieve your objective per se -- the Gretzky dummy would only be equal to 1 for observations pertaining to Gretzky... Coffey would only have the Coffey dummy.. Kurri, the Kurri dummy... if Anderson had been free-riding on all those guys you still wouldn't catch it.

In the model with dummy variables, I'm not trying to capture any teammate effects (at least not at this point). The dummy variables for each player were to filter out the effect of that player's quality on the observed Y (individual's points that season), so that Gretzky's points at age A in Season S wouldn't be expected to be the same as a "normal" first line player at age A in season S.

Quote:

Originally Posted by barneyg

You're basically looking for a "quality of teammates" measure.. instead of going all the way into a different model with thousands of observations (every player, every year), couldn't you come up with a measure of how "concentrated" Y is? i.e. Y is adjusted scoring of top N players, what is the % of those N players that play on the top X teams? i.e. if in 1983-84 you have 5 Oilers and 3 Bruins out of the top 21 players, that's a "top 2 team" concentration of 8/21 = 38%. Of maybe some measure of distance between the very top player (Gretz 205) and the best player not on the Oilers (Goulet 121)..

I think this is a very good idea, and exactly the kind of input I was hoping for to improve the model in an objective manner.

It appears you may have misunderstood the other potential model with dummy variables. I was planning to (once workable) include a large number of observations of Y (player-season), but still limit it to top tier players, since it's the effects on those players that we're generally most interested in (i.e. those are the players most frequently compared). Players like Rob Brown wouldn't be in that model. Players like Kurri, Coffey and possibly Anderson would be, but the Y values for each of those players often had similar "teammate effects" in multiple seasons, so I expected much of that to be captured in the dummy variable for that individual player. Again, my main objective was to use the coefficient for each season, and probably age, as the basis for further analysis. The individual player coefficients would be interesting, but the purpose of their presence would be primarily to increase the accuracies of the coefficient for variables in other categories.

Quote:

Originally Posted by barneyg

I would rephrase the bolded as: I don't really care how these changes influenced each other, all I want is to find a way to make the model more stable with respect to the implied relationship between each of those variables and adjusted scoring.

I think going back in time mitigates this problem. If that means dropping PPG or PPO because of data availability, so be it -- you can always compare the other coefficients for (say) 1950-2010 with the model for 1964-2010 which will include it.

My goal was to create the most accurate, objective model possible, given the limits of the data. It seems there may be a potential trade off between a model that explains more of the variability in Y (measured by R^2) and one in which we are more confident in the accuracy of that (lower amount of) variability which the model does explain. It seems you believe more accuracy is better than a larger degree of explanation, but is this generally accepted as preferable or more of a gray area?

Quote:

Originally Posted by barneyg

R^2 went from 0.32 to 0.56, is it due to the addition of 1968-1976 or those parity indicators? I would assume both.

Note: B0 is y-intercept, values given are (coefficient value, T-score), where T-score = (coeff. value / standard error) * N^.5

It seems to me that going back to 1968 substantially increases our confidence in the variables, while using 2-3 additional variables substantially increases how much of the variation in Y is explained by the variables. Doing both seems to give us the highest value of R^2, while still giving us a relatively high degree of confidence in our variables. Xe is more borderline, but one reason may be that it needs to be refined a bit (it uses non-Canadian scoring forwards in top N scoring forwards... forwards needs to be replaced with "scorers" or Y needs to be changed to "forwards"... this was done due to current availability of data, and don't want to spend extra time on such calculations until model is or is nearly complete).

I am a major in University for Business Information Statistics. It`s good to finally see multiple regression formulas being used for the greater good . I hope to learn enough from you to become more understanding in fluent in my replies. Keep it up dude, pretty interesting stuff

Thanks for the encouragement. I'm a bit over my head here, but I find it interesting and it will help me learn more about both hockey and regression models, as well as refreshing what little knowledge I already had. Fortunately, barney has been very helfpul in trying to keep me on the right path.

I believe this is a good approach for studying various issues, but it seems to rarely be used for such... and I'm starting to understand why.

Xn & Xe are generally the most correlated variables. The largest correlations not involving either of these variables are bolded above. Running the model from 1968-2012 without those two variables, generates the following results: