I added one more variable. This variable is intended to capture some of the effect of "offensive powerhouses". A player like Orr, Gretzky or Lemieux may elevate the point totals of teammate(s), which would increase Y. A lack of parity in the league may also aid this process, but much of this should be captured in the parity variables (Xf & Xa). The variable is Xt and is defined as follows: the GF for the top 0.2N teams are added (for 21 teams, the top 4 teams, plus .2 * 5th team) and divided by 0.2N. This number is divided by league avg. GF, and one is subtracted from the result (this is the ratio to avg. by which the "powerhouses" differed from avg.). This result is then divided by Xf to scale the result based on parity (I thought this should help separate the variable from Xf and prevent much of the potential overlapping).
I ran another regression for '68'12 and it increased R^2 to .583, with all variables again being significant. Values were as follows:
B = 78
Mn = (.24)
Mf = 91
Ma = (27)
Mg = (1.2)
Me = 4.3
Mp = 45
Mt = 2.7
Here are the value ranges for each variable from '68'12:
Y: 82.6 to 99.7 (avg. 90)
Xn: 12 to 30 (avg. 23)
Xf: .078 to .21 (avg. .13)
Xa: .097 to .23 (avg. .14)
Xg: 5.14 to 8.03 (avg. 6.4)
Xe: 0 to .63 (avg. .27)
Xp: .214 to .385 (avg. .28)
Xt: .23 to 1.8 (avg. 1.3)
The average of the absolute error between predicted and actual was 2.11 (2.14 in 6 variable model). This compared favorably to the standard deviation of 3.86 in actual Y values.
If the variances (square of standard deviations) are used for Xf & Xa, this actually increases the R^2 to .588, decreases the avg. absolute error to 2.08, and all variables remain significant.
I still believe one of the most important missing "variables" is the presence or absence of certain great players at different times. For instance, how does one measure the fact that Ovechkin, Malkin, Crosby, and Thornton were all mostly healthy and at/near the top of their game in 2008... but in 2011, these players were mostly injured and/or off their games? It seems that incorporating discrete variables is more difficult than I initially thought, so I'm not sure how to measure this aspect of each season.
Last edited by Czech Your Math: 11192012 at 03:18 PM.
Reason: 0.2N mistakenly written as 2N
