View Single Post
11-19-2012, 04:57 PM
Registered User
Join Date: Apr 2007
Posts: 2,383
vCash: 500
Originally Posted by Czech Your Math View Post
I removed one dummy variable from each category, but the coefficients still won't generate. Perhaps I misunderstood your suggestion (I removed X(s1980), X(a40) and X(p12)... p12 was the dummy variable for Ray Bourque).
Make sure you have no single dummy (ex: X(a39)) for which all values are exactly 0 (or 1, though that shouldn't be the case if you coded the data correctly.. the intercept should be the only variable equal to 1 for all observations).

Originally Posted by Czech Your Math View Post
I'm not sure how to bypass the problem of high correlation between many of the variables.

I substituted PPO/game for PPG as % of total goals. This didn't seem to affect the model too much (R^2 rose slightly to .589). I think this is better, since it prevents a potential correlation between decreased scoring and PPG as % of goals (as you pointed out), for whatever reason.

I also ran the regression back to 1964, the first year for which I have data for power plays. This decreased R^2 to .48, which is a substantial reduction. I think the problem is that with only 6 teams, Y is based on the average of only 6 players, and other variables are based on calculations for only 6 teams. It may be best to use a model for '68-present (or perhaps an even shorter span if it's found to be much more reliable), and a separate model for the O6. For now, I will concentrate on '68-present, since there is more/better data available for those seasons, and there are more interesting and reliable changes in the variables.
Unless you're strictly in the prediction business, it's no big deal if you get a lower R^2 as long as you get more consistent estimates (i.e. reduced standard errors on the coefficients).

Are correlations still as high with the 1968-2012 data?

Last edited by barneyg: 11-19-2012 at 07:02 PM. Reason: rephrased the first sentence (double negative)
barneyg is offline   Reply With Quote