View Single Post
11-19-2012, 04:27 PM
Czech Your Math
Registered User
Czech Your Math's Avatar
Join Date: Jan 2006
Location: bohemia
Country: Czech_ Republic
Posts: 4,846
vCash: 500
Originally Posted by barneyg View Post
It's definitely possible -- the only reason you aren't getting a solution is that you fell into the dummy variable trap. If you remove the 2011 dummy, none of the remaining dummies can be expressed as a linear combination of the others because it's no longer true that the sum of all dummies is equal to 1 for each observation (as the sum will be 0 for the 2011 observations).

"Linear independence" is quite a bit more permissive than "independence" as usually defined in probability theory. You can have heavily correlated variables -- this creates other problems, but they will still be considered linearly independent as long as one of those variables isn't completely determined by a linear combination of the others.
I removed one dummy variable from each category, but the coefficients still won't generate. Perhaps I misunderstood your suggestion (I removed X(s1980), X(a40) and X(p12)... p12 was the dummy variable for Ray Bourque).

Originally Posted by barneyg View Post
As I wrote in the post that followed, it doesn't really matter whether those correlations make intuitive sense -- high correlations in the regressors can mess up the results big time.

That said, I would assume that going all the way back to 1964 is going to mitigate some of your problem because you'll be adding years where expansion led to increase scoring (reducing the correlation between Xn and Xg), while keeping the % of Euros relatively constant (reducing correlation of Xe with everything else).
I'm not sure how to bypass the problem of high correlation between many of the variables.

I substituted PPO/game for PPG as % of total goals. This didn't seem to affect the model too much (R^2 rose slightly to .589). I think this is better, since it prevents a potential correlation between decreased scoring and PPG as % of goals (as you pointed out), for whatever reason.

I also ran the regression back to 1964, the first year for which I have data for power plays. This decreased R^2 to .48, which is a substantial reduction. I think the problem is that with only 6 teams, Y is based on the average of only 6 players, and other variables are based on calculations for only 6 teams. It may be best to use a model for '68-present (or perhaps an even shorter span if it's found to be much more reliable), and a separate model for the O6. For now, I will concentrate on '68-present, since there is more/better data available for those seasons, and there are more interesting and reliable changes in the variables.

Czech Your Math is offline   Reply With Quote