By The NumbersHockey Analytics... the Final Frontier. Explore strange new worlds, to seek out new algorithms, to boldly go where no one has gone before.

With the new season being announced, so too has the fantasy season begun.

I went and found the career stats, broken down by season, for each of the top 150 or so skaters (no goalies). I've been trying to run a regression analysis to figure out which picks will be the best ones.

I've been measuring the players goals, assists and power play points (the only skater stat categories my league measures) against the adjusted stats for each category.

As you can see in the attachment, I've been getting an R^2 value of 0.90 to 0.99. I feel that this is way too high, but it makes sense due to the direct correlation of the raw stat to the adjusted stat.

I am uncertain of if I am measuring the correct variables, and if not, which values I should change.

$ESG - Even strength goals, adjusted to a league scoring level of 200 ESG per team.
$ESA - Even strength assists, adjusted to a league scoring level of 200 ESG per team
$PPP - Power play points, adjusted to a league scoring level of 70 PPG per team and a league-average number of power play opportunities.

My independent variables are Even strength goals, even strength assists and power play points.

Without knowing more about what you're doing, I would expect a high R^2, since you seem to be (essentially) translating what actually happened and using it to predict the same season.

So I guess the follow-up question is this - how does this help you to predict who's going to do well in 2013?

What I'd recommend - if you prefer this style approach - is use season N variables (whichever you think might work, probably including age) as the independent variables, and use season N+1 goals/assists/points as the dependent variables.

And the solution for not knowing which variables will work? Try lots of things - that's half of the fun

I included age as my dependent variable, and the graphs I get really don't show me too much.

I can send them to you, maybe I'm doing something wrong.

Not too sure how I can use this data to help look forward. Would I want to analyze each individual player, and extrapolate from there? Or just look at groups of players and see which are the standouts?

I included age as my dependent variable, and the graphs I get really don't show me too much.

I can send them to you, maybe I'm doing something wrong.

Not too sure how I can use this data to help look forward. Would I want to analyze each individual player, and extrapolate from there? Or just look at groups of players and see which are the standouts?

If you include age, it should be an independent (X) variable.

What categories are counted in your fantasy league?

I tend to think this is sort of like building a laser device to measure shoe size... much more complicated than it needs to be. Probably you should focus on situations where you expect players to outperform their past seasons, due to being at peak age, getting increased playing time, or having better linemates than in the past. Remember that goal scorers tend to peak earlier (many in their early 20s and most by their mid 20s) than playmakers.

If you include age, it should be an independent (X) variable.

What categories are counted in your fantasy league?

Yes, that's what I have it set as. The problem is when I have 150 players stats over the majority of their career, it doesn't tell me too much when I look at it all at once.

It's goals, assists, powerplay points, and hits. And it's an 8 man league.

What I've figured I'll do is sort it by each 20 players or so of similar ranking, giving me a better look at which players to expect to look for draft round per draft round.

Quote:

I tend to think this is sort of like building a laser device to measure shoe size... much more complicated than it needs to be. Probably you should focus on situations where you expect players to outperform their past seasons, due to being at peak age, getting increased playing time, or having better linemates than in the past. Remember that goal scorers tend to peak earlier (many in their early 20s and most by their mid 20s) than playmakers.

Ya I realize all of that. I just felt that there would be a way to better back up my intuition, and I feel that having some numbers behind it really adds to it.

It would be limited, but you could control for age (say 22-30), take a per-game average of the past three seasons per player and then look at the delta between last season and that average. You'd need to do some weeding out for injuries, etc. but for the most part you'd have a decent time finding guys who underperformed comapred to expectations last season, who would then be "buy low" candidates on the draft board.

You could also go to behindthenet.ca and just look for guys above a certain TOI/game who have a low PDO. Beyond looking in terms of value based on variance, I don't see a particularly good way of using statistical analysis to find good picks. Everything else is a dog's breakfast of assumptions about how the player will be used the coming season, who their linemates may be, their ozone starts, etc.

Yes, that's what I have it set as. The problem is when I have 150 players stats over the majority of their career, it doesn't tell me too much when I look at it all at once.

It's goals, assists, powerplay points, and hits. And it's an 8 man league.

What I've figured I'll do is sort it by each 20 players or so of similar ranking, giving me a better look at which players to expect to look for draft round per draft round.

Ya I realize all of that. I just felt that there would be a way to better back up my intuition, and I feel that having some numbers behind it really adds to it.

Perhaps the most useful type of regression would be a time series. Basically, your dependent and independent variables are the same, except the independent variables are time lagged. For instance, for goals:

Y = B0 + M1X1 + M2X2 + ... where X1 is T-1, X2 is T-2

I.e., Y could be goals in 2012, X1 is goals in 2011, X2 is goals in 2010, etc., all for the same player. Another Y would be goals in 2011, with X1 then being goals in 2010, X2 goals in 2009, etc.

You can try different combos, but I would guess doing separate studies for each category would work best. Rather than just use raw goals, using adjusted GPG is probably going to yield more useful coefficients (otherwise variability in games may affect them as much or more than skill level).

I did a quick, simple study as an example, using (when possible) Y seasons of 2008-2012 for each of several players (Crosby, Malkin, Ovechkin, Stamkos, St. Louis, H.Sedin, Kovalchuk, Thornton, and Iginla):

Adjusted Total GPG: Y = .128 + .406*X1 + .33*X2
Adjusted Total APG: Y = .292 + .531*X1 + .078*X2

(Y = per-game metric in Year T, X1 = same metric in Year T-1, X2 = same metric in Year T-2).

The further back you lag the series, the more observations you lose, and the more likely it is that those further lagged variables will be insignificant. Also the Y-intercept (e.g. .128 for GPG in the above example) is going to vary with skill level, so you may have to either group players by general skill level in each category, or not use a Y-intercept.

For GPG, lagged independent variables such as shots/game or Sh% might also be useful.

Last edited by Czech Your Math: 01-11-2013 at 05:36 PM.

Perhaps the most useful type of regression would be a time series. Basically, your dependent and independent variables are the same, except the independent variables are time lagged. For instance, for goals:

Y = B0 + M1X1 + M2X2 + ... where X1 is T-1, X2 is T-2

I.e., Y could be goals in 2012, X1 is goals in 2011, X2 is goals in 2010, etc., all for the same player. Another Y would be goals in 2011, with X1 then being goals in 2010, X2 goals in 2009, etc.

You can try different combos, but I would guess doing separate studies for each category would work best. Rather than just use raw goals, using adjusted GPG is probably going to yield more useful coefficients (otherwise variability in games may affect them as much or more than skill level).

I did a quick, simple study as an example, using (when possible) Y seasons of 2008-2012 for each of several players (Crosby, Malkin, Ovechkin, Stamkos, St. Louis, H.Sedin, Kovalchuk, Thornton, and Iginla):

Adjusted Total GPG: Y = .128 + .406*X1 + .33*X2
Adjusted Total APG: Y = .292 + .531*X1 + .078*X2

(Y = per-game metric in Year T, X1 = same metric in Year T-1, X2 = same metric in Year T-2).

The further back you lag the series, the more observations you lose, and the more likely it is that those further lagged variables will be insignificant. Also the Y-intercept (e.g. .128 for GPG in the above example) is going to vary with skill level, so you may have to either group players by general skill level in each category, or not use a Y-intercept.

For GPG, lagged independent variables such as shots/game or Sh% might also be useful.

I understand the concept behind this, but I do not know how to execute this in excel. Given that my draft is in 4 hours, I don't think I'm going to have enough time to do this.

I did go ahead though and sort players into groups of similar skill level/draft position. The groups range in size from 4-8 players. I also have a draft strategy, where each round I am picking from a corresponding group. I have 6 groups for forwards, 2 for defensemen and 2 for goalies.

This strategy gives me a good grasp of which players to look out for, and when. The only really big issue is that they aren't organized by position, so I have to be monitoring that on the fly so that I don't draft 4 centremen and only 1 winger (our league has 2 positions of each C, LW, RW, 4 D, 2 Goalies and 1 Utility spot).

All in all I feel comfortable and confident going into this draft.

I understand the concept behind this, but I do not know how to execute this in excel. Given that my draft is in 4 hours, I don't think I'm going to have enough time to do this.

I did go ahead though and sort players into groups of similar skill level/draft position. The groups range in size from 4-8 players. I also have a draft strategy, where each round I am picking from a corresponding group. I have 6 groups for forwards, 2 for defensemen and 2 for goalies.

This strategy gives me a good grasp of which players to look out for, and when. The only really big issue is that they aren't organized by position, so I have to be monitoring that on the fly so that I don't draft 4 centremen and only 1 winger (our league has 2 positions of each C, LW, RW, 4 D, 2 Goalies and 1 Utility spot).

All in all I feel comfortable and confident going into this draft.

I think the main thing is to have a range for each player based on past performance. Using regression for this is probably more trouble than it's worth. However, for future reference, this is how you would construct a time series.

For each player included in the study, do the following:

Let's use Ovechkin's 7 seasons from '06-'12 as an example. Calculate the Y variable in the manner you believe is most reliable for your study. For goals, I might suggest using "adjusted GPG." Once you have calculated this, it will be your dependent (Y) variable and should be in one column. So Ovechkin's adjusted GPG for seasons '06-'12 might be in cells C1-C7. Next, label your independent (X) variables, which will be time lagged from your Y variable. You might label them T-1, T-2, etc. You would then copy and paste cells C1-C6 into cells D2-D7 for variable T-1. You don't copy cell C7, because that would be his adjusted GPG for 2012, and couldn't be used until at least 2013. You don't copy anything into cell D1, because he has no data before 2006. For variable T-2, you would copy cells C1-C5 (or D2-D6) into cells E3-E7. Again, for each season you lag, you would lose one observation (i.e., if only using T-1, then lose Y for 2006... if using T-2 also, then also lose Y for 2007). You don't want any gaps in your X variables (e.g., having a T-1, but no T-2), as this will affect your results. If you only used T-1 & T-2 as X variables, and found T-2 to be an insignificant variable, then you would recaulculate the regression only using T-1.

Hopes this makes some sense and may be useful to someone.

Let's use Ovechkin's 7 seasons from '06-'12 as an example. Calculate the Y variable in the manner you believe is most reliable for your study. For goals, I might suggest using "adjusted GPG." Once you have calculated this, it will be your dependent (Y) variable and should be in one column. So Ovechkin's adjusted GPG for seasons '06-'12 might be in cells C1-C7. Next, label your independent (X) variables, which will be time lagged from your Y variable. You might label them T-1, T-2, etc. You would then copy and paste cells C1-C6 into cells D2-D7 for variable T-1. You don't copy cell C7, because that would be his adjusted GPG for 2012, and couldn't be used until at least 2013. You don't copy anything into cell D1, because he has no data before 2006. For variable T-2, you would copy cells C1-C5 (or D2-D6) into cells E3-E7. Again, for each season you lag, you would lose one observation (i.e., if only using T-1, then lose Y for 2006... if using T-2 also, then also lose Y for 2007). You don't want any gaps in your X variables (e.g., having a T-1, but no T-2), as this will affect your results. If you only used T-1 & T-2 as X variables, and found T-2 to be an insignificant variable, then you would recaulculate the regression only using T-1.

Well, turns out my draft got postponed till tomorrow. Perfect, gives me time to try this out.

Not too sure how I can use this data to help look forward. Would I want to analyze each individual player, and extrapolate from there? Or just look at groups of players and see which are the standouts?

I use the previous three years data in a similar way (adjust dependent on age, sh%, etc).

The main use I find is to spot inconsistencies between your data, pre-draft rankings and actual draft order (i.e. in real time if you can organise your spreadsheet to allow you to do so). This hopefully allows you to pick up a few 'bargains'.