View Single Post
11-19-2012, 03:16 PM
Czech Your Math
Registered User
Czech Your Math's Avatar
Join Date: Jan 2006
Location: bohemia
Country: Czech_ Republic
Posts: 3,431
vCash: 50
Originally Posted by barneyg View Post
I'm not sure how to interpret those 3 extra variables. The R-square doesn't increase all that much when you add Xt but the coefficients on the main variables are all over the place. What's your rationale for a link between the concepts "parity" and "top player scoring"? Define that hypothesis, and then pick a construct that you think represents it best -- 3 variables makes it too tough to interpret IMO.
I think Xf & Xa (parity of team GF & GA) are important, but I agree that Xt doesn't seem a crucial variable.

My main hypothesis, regarding parity in this model, is that the more team parity there is, the more difficult it is generally for the top tier of players to exceed the average in % terms. IOW, there's not as much running up the score with meaningless points when teams are closely bunched together in terms of ability.

With Xt, I was trying to capture the "teammate/linemate" effect in an objective manner, as well as how easy it was for teams to become offensive juggernauts (even after factoring out general parity). How does one factor in the fact that Gretzky or Lemieux, e.g., may have significantly elevated one or more teammates' point totals? Also, how does one factor in the presence of phenoms such as Gretzky and Lemieux in the league? I'm just not sure how to capture these effects objectively. Eliminating such "outliers" from the Y population is changing the Y population subjectively... and where do you draw the line and stop eliminating "outliers"?... and what about their teammates? This is one reason I was trying to build the model with the dummy variables, because it would bypass this problem.

Originally Posted by barneyg View Post
As for your Xt variable definition, I'm assuming you mean "the GF for the top N/5 teams"? "top 2N" would mean the top 42 teams for a 21-team league, doesn't make sense..
My mistake, 2N should have been listed as 0.2N.

Originally Posted by barneyg View Post
How about a variable for % of games missed by the people that were included in Y the previous year? or % of games missed by the top x% of people with the highest points/game?
Those are interesting ideas. The % of game missed by players in previous season's "Y" would take a lot of time. The % games missed by top x% of players seems more practical, although would probably need to set a minimum number of games. I think it will probably be marginally useful (like Xt) or insignificant, but it's still worth using a practical variable for injuries and seeing if it adds to the model's significance.

My hunch is that the "Gretzky, Lemieux & friends" effect is going to be much larger than the injury factor. It's really difficult to measure why Ovechkin went from top of the league to just another very good scorer (and the effects on Backstrom, Semin, etc.). He wasn't injured, he's still in his prime. I don't think it's easy to objectively measure that.

Originally Posted by barneyg View Post
The 'source' of the correlation ('coincidental' or otherwise) doesn't really matter -- as I previously wrote, multicollinearity makes the coefficients unstable and sensitive to the variables you include next: look at what happened with Mn, Mg and Me when you added Xt as a 7th regressor.
You're right, but it seems unavoidable, unless some of those variables are re-defined.

Originally Posted by barneyg View Post
Let me suggest: (-> means leads to)
90s expansion -> the trap -> decreased scoring
can't trap on special teams -> increased % of PP scoring (vs total scoring)

If you don't like "the trap" you can substitute with "video analysis" or something. That would explain all correlations except those involving Euros. Then,

salary increases (late 80s) -> Euro influx
fall of the Iron Curtain (89-91) -> Euro influx

You can't really operationalize all this stuff into regressions but if you used data points from 1960 onward (therefore including the first expansions) all those correlations would decrease significantly (I would assume early-mid 70s expansion and the addition of the WHA teams led to increased scoring in the NHL, which would reverse that negative Xn/Xg relationship).
I can see how "the trap" could possibly, by itself, lead to increased PP goals as % of total goals. However, PP opportunities (PPO) were generally substantially lower until the mid-late 80s than they were up to that point. Maybe I should use define Xp as PPO/game instead of PPG/Total Goals?

Salaries may have given incentive for even more Euros to join the NHL, but the fall of the Iron Curtain seems the main reason, as it allowed in the NHL a large pool of players that were previously prevented from playing there.

The most recent regressions were run from '68 to '12, but PP data goes back to '64, so I will go back that far when I next run the model.

The problem is that a lot of these changes happened in a short time: increased PPOs starting in the late 80s... salaries are ever-increasing... fall of Iron Curtain in early 90s... expansion beginning in early 90s and continuing during decade... large Euro/Russian influx in early & mid 90s... scoring decreasing in mid-90s... increased parity from the mid-90s... increased use of defensive systems and better/larger goalie equipment, etc.

Some of these would be expected to show substantial correlation and do. Some would be expected, but show a much smaller correlation or even one opposite in direction to that expected. Some wouldn't be expected to have a large correlation, yet do. It's not easy to objectively determine how these changes are influencing each other, even if we agree that the effect on the model is to make it more sensitive to additional variables.

Thanks for your insight on the model and its related components. It appears I'm a bit over my head here, which is especially difficult when tackling this as a solo project, so your help is much appreciated.

Last edited by Czech Your Math: 11-19-2012 at 03:22 PM.
Czech Your Math is offline   Reply With Quote