HFBoards

HFBoards (http://hfboards.hockeysfuture.com/index.php)
-   By The Numbers (http://hfboards.hockeysfuture.com/forumdisplay.php?f=241)
-   -   Introducing a new stat: Location Adjusted Expected Goals Percentage (http://hfboards.hockeysfuture.com/showthread.php?t=1490701)

Wesleyy 08-28-2013 04:04 PM

Introducing a new stat: Location Adjusted Expected Goals Percentage
 
http://hockeymetrics.net/introducing...ls-percentage/

Basically an improvement of corsi, it takes shot location into account on top of shot quantity. Took me over a month of work to finish, would love to hear your feedback.

do0glas 08-28-2013 04:23 PM

unfortunately i cant view the heat maps at work.

just to help me understand better: you gave a percentage weight to spots on the ice given how often goals are scored from their league wide?

Wesleyy 08-28-2013 04:43 PM

Quote:

Originally Posted by do0glas (Post 70594705)
unfortunately i cant view the heat maps at work.

just to help me understand better: you gave a percentage weight to spots on the ice given how often goals are scored from their league wide?

right, the heatmap is just a visual representation of the percentages. It's pretty much what you would expect, the percentage lowers as it becomes further from the slot.

do0glas 08-28-2013 04:58 PM

Quote:

Originally Posted by Wesleyy (Post 70595331)
right, the heatmap is just a visual representation of the percentages. It's pretty much what you would expect, the percentage lowers as it becomes further from the slot.

okay,

im trying to wrap my head around it all.

is the LAEGAP an average percentage, so basically you took all of the heatmap data and gave every shot an average? or does each individual player get a percentage based on where they take the majority of their shots?


either way, i think you did a great job on this. im just trying to better understand it :dunce:

Doctor No 08-28-2013 05:15 PM

Wesleyy, thanks for putting this together. I intend to comment more later, but a quick read-through suggests that this is a great step forward.

Wesleyy 08-28-2013 06:44 PM

Quote:

Originally Posted by do0glas (Post 70595899)
okay,

im trying to wrap my head around it all.

is the LAEGAP an average percentage, so basically you took all of the heatmap data and gave every shot an average? or does each individual player get a percentage based on where they take the majority of their shots?


either way, i think you did a great job on this. im just trying to better understand it :dunce:

Essentially, the heatmap is a visual representation of the league average shot percentage at each position on the ice. EGF is the estimated "+" in +/- for each player if his shots at their respective locations have a league average chance of going in. EGA is the "-". EG% is (EGF/(EGF+EGA)). LAEGAP is a estimation of the lowest possible "true" EG% in a calculated interval with a 95% confidence. The more you play, the smaller the interval, and the higher your LAEGAP will be, assuming you are perfectly consistent to your past performance throughout your ice time.

Ohashi_Jouzu 08-28-2013 09:26 PM

Really like where this is going, what it strives to do, and the apparent reliability. Good job, bro. Looking forward to everything this can turn into.

Cunneen 08-28-2013 11:45 PM

Unfortunately, I don't think the shot location data is even close to accurate enough for us to create metrics based off shot location. The data that the NHL tracks is just horrendous. Truly horrendous.

http://www.habseyesontheprize.com/20...m-shot-tracker

Doctor No 08-29-2013 12:07 AM

Quote:

Originally Posted by Cunneen (Post 70608303)
Unfortunately, I don't think the shot location data is even close to accurate enough for us to create metrics based off shot location. The data that the NHL tracks is just horrendous. Truly horrendous.

http://www.habseyesontheprize.com/20...m-shot-tracker

Reading that blog, I don't see the errors as being as large as the blogger claims.

Plus, the foundation of the blog's claim is that the other measurement is perfectly accurate (which it certainly can't be). The fact that the two measurements differ isn't 100% the fault of the NHL's tally.

Beyond that, let's suppose that the measurement is as inaccurate as the blogger claims - it's still pretty good. What's a better analytical tool; one that's slightly off (but still gets the general gist that shots closer to the net are better scoring chances) or one that treats all shots on goal as equally likely to produce a goal (as CORSI does)?

Wesleyy 08-29-2013 12:25 AM

Quote:

Originally Posted by Taco MacArthur (Post 70608939)
Reading that blog, I don't see the errors as being as large as the blogger claims.

Plus, the foundation of the blog's claim is that the other measurement is perfectly accurate (which it certainly can't be). The fact that the two measurements differ isn't 100% the fault of the NHL's tally.

Beyond that, let's suppose that the measurement is as inaccurate as the blogger claims - it's still pretty good. What's a better analytical tool; one that's slightly off (but still gets the general gist that shots closer to the net are better scoring chances) or one that treats all shots on goal as equally likely to produce a goal (as CORSI does)?

On top of that, I've put a lot of effort into correcting the data as best as I could. I think my method of correcting the recording bias in the NHL is the most accurate out of all the other ones I've read. Obviously it's not perfect, but over a large sample size it should provide a much better estimation than the raw data.

Cuneen, if you read the methodology post (I don't know if you read it yet) which I go into a lot more detail of how I attempt to solve the recording bias problem, you might change your mind regarding the data accuracy.

matnor 08-29-2013 05:41 AM

First off, let me you just say that I think this is very interesting work and I think you have done a very good job! I have a couple of comments and suggestions which possibly could be used to improve your work further.

1) I read the methodology paper and I am a bit unsure how you correct for arena bias. It seems to me that you just remove the average error for each arena but that seems a bit odd as shots taken close to the net are unlikely to have the same error as those taken far from the net. Wouldn't this also mean that you record some shots as taken behind the net when they were actually taken right in front. Maybe I'm missing something here, I haven't really thought it through.

2) The weighting function you use seems perfectly fine but is arbitrarily chosen. If you are really interested, you can use a data-driven method to select the weighting function. My suggestion would be to use kernel regression with a cross-validation method to select the bandwidth. It might be that there isn't enough data to get a small enough bandwidth, but it could be worth trying. I can recommend using the np-package in R for this. I should say, this is a really technical comment that is by no means necessary, and I don't really know what your background is, but if you are interested in learning about nonparametric estimation techniques it might be fun to test :)

3) I'm not really sure I think the way you use the lower bound of the confidence interval to take care of the small sample issue is the best way. Just spitballing an idea here: what if you instead used a Bayesian method setting the prior shooting percentage to be 0? I'm not very familiar with Bayesian statistics but it seems to me that it could take care of the problem. Otherwise, I know that people often only want to show a single estimate but I think it's better to show the confidence interval to indicate just how uncertain the statistic is.

4) It would be really nice to see a scatterplot comparing your method with regular CORSI to get an impression of how important shot location is.

Anyway, these comments aren't that important and it seems that what you have done works perfectly fine, just throwing out some ideas. :)

Ohashi_Jouzu 08-29-2013 09:53 AM

Quote:

Originally Posted by Taco MacArthur (Post 70608939)
Reading that blog, I don't see the errors as being as large as the blogger claims.

Plus, the foundation of the blog's claim is that the other measurement is perfectly accurate (which it certainly can't be). The fact that the two measurements differ isn't 100% the fault of the NHL's tally.

Beyond that, let's suppose that the measurement is as inaccurate as the blogger claims - it's still pretty good. What's a better analytical tool; one that's slightly off (but still gets the general gist that shots closer to the net are better scoring chances) or one that treats all shots on goal as equally likely to produce a goal (as CORSI does)?

Totally agree that it looks like a step in the right direction. Matnor also makes some interesting suggestions for fine-tuning.

Kershaw 08-29-2013 10:16 AM

Damn looks like a lot of work was put into it, I will continue to follow this. Great work and it is pretty challenging. And I agree that this is a step in the right direction.

blue425 08-29-2013 11:46 AM

Gave it a look and my brain melted after a few minutes. Damn fine work though.

I'll try again..

do0glas 08-29-2013 12:24 PM

Quote:

Originally Posted by Wesleyy (Post 70599361)
Essentially, the heatmap is a visual representation of the league average shot percentage at each position on the ice. EGF is the estimated "+" in +/- for each player if his shots at their respective locations have a league average chance of going in. EGA is the "-". EG% is (EGF/(EGF+EGA)). LAEGAP is a estimation of the lowest possible "true" EG% in a calculated interval with a 95% confidence. The more you play, the smaller the interval, and the higher your LAEGAP will be, assuming you are perfectly consistent to your past performance throughout your ice time.

okay, i like that explanation better. maybe it was tougher since i couldnt view the heat map.

So Corsi is a posession metric, rather than a shot metric. it just happens to use shots as the basis to determine possession. Would you say this is more of a scoring metric? who are the guys that can sustain a consistently higher than average shot quality?

ive always though that a passes completed percentage in the offensive zone is a more reliable possession metric, but no one tracks that data like they do for soccer.

LAEGP seems like it would be great to have along side Corsi (IE: player x seems to really boost his teams posession on ice, but does he improve the actual goals percentage in a tangible way?) so it really complements Corsi rather than improves upon it, imo. so for someone like Tyler Kennedy, who seems to just take shots from anywhere...does his volume shooting really have a tangible effect on the ice, or is it just keeping the puck in the zone hoping for rebounds?

great stuff

VinnyC 08-29-2013 05:17 PM

Very interesting work. Thank you so much!

BTW, have you aggregated data from before 2011-12? I like how the results could be replicated between the last two seasons, but it would be even better to see how the metric stacks up over a larger selection of seasons. I've been flirting with some new statistics, and I've noticed the correlation from season to season can vary quite a bit.

EDIT:

To jump in the above conversation, I think the debate ends up being more philosophical to determine whether if shots taken, adjusted for location makes for a better metric than simple shot-taking. For instance, a team can spend a lot of time in the offensive zone, with the ultimate goal of producing lots of point shots; or attempt to produce relatively few shots from the slot. When we consider that sort of scenario, you can say a location-adjusted metric would be better since it accounts for the fact the latter team is trying to generate higher percentage plays than the former team.

Badger Mayhew 08-29-2013 05:55 PM

The link says you're 17? I was not expecting such quality work to be done by somebody your age. Very impressive.

Ohashi_Jouzu 08-29-2013 07:36 PM

Quote:

Originally Posted by do0glas (Post 70619771)
ive always though that a passes completed percentage in the offensive zone is a more reliable possession metric, but no one tracks that data like they do for soccer.

I've always kinda liked the idea of this, as well. Like many hockey playing Canadians, my summer sport/passion was soccer, and I've also wondered what kind of trends we'd see if average number of touches from possession to scoring play, or consecutive during possession in general, was tracked. It would be interesting if teams with the lowest average touches to create a scoring play from possession happened to be viewed as the more "potent" offenses, or if teams with the most touches per possession were seen as good "possession" teams.

VinnyC 08-29-2013 09:20 PM

Quote:

Originally Posted by Ohashi_Jouzu (Post 70632427)
I've always kinda liked the idea of this, as well. Like many hockey playing Canadians, my summer sport/passion was soccer, and I've also wondered what kind of trends we'd see if average number of touches from possession to scoring play, or consecutive during possession in general, was tracked. It would be interesting if teams with the lowest average touches to create a scoring play from possession happened to be viewed as the more "potent" offenses, or if teams with the most touches per possession were seen as good "possession" teams.

Problem with tracking passes is that hockey is much faster paced than soccer, and it's arguable if many common plays that end up with the puck going to another player can actually be called a "pass" (e.g. is a dump-in a pass? a deflection? a banked shot? a loose puck? a clearing attempt?)

Wesleyy 08-30-2013 01:19 AM

Quote:

Originally Posted by matnor (Post 70613139)
First off, let me you just say that I think this is very interesting work and I think you have done a very good job! I have a couple of comments and suggestions which possibly could be used to improve your work further.

1) I read the methodology paper and I am a bit unsure how you correct for arena bias. It seems to me that you just remove the average error for each arena but that seems a bit odd as shots taken close to the net are unlikely to have the same error as those taken far from the net. Wouldn't this also mean that you record some shots as taken behind the net when they were actually taken right in front. Maybe I'm missing something here, I haven't really thought it through.

2) The weighting function you use seems perfectly fine but is arbitrarily chosen. If you are really interested, you can use a data-driven method to select the weighting function. My suggestion would be to use kernel regression with a cross-validation method to select the bandwidth. It might be that there isn't enough data to get a small enough bandwidth, but it could be worth trying. I can recommend using the np-package in R for this. I should say, this is a really technical comment that is by no means necessary, and I don't really know what your background is, but if you are interested in learning about nonparametric estimation techniques it might be fun to test :)

3) I'm not really sure I think the way you use the lower bound of the confidence interval to take care of the small sample issue is the best way. Just spitballing an idea here: what if you instead used a Bayesian method setting the prior shooting percentage to be 0? I'm not very familiar with Bayesian statistics but it seems to me that it could take care of the problem. Otherwise, I know that people often only want to show a single estimate but I think it's better to show the confidence interval to indicate just how uncertain the statistic is.

4) It would be really nice to see a scatterplot comparing your method with regular CORSI to get an impression of how important shot location is.

Anyway, these comments aren't that important and it seems that what you have done works perfectly fine, just throwing out some ideas. :)

1) I agree it's not perfect and some points do end up on the other side of the goal/blue line, and obviously each shot at their respective arenas do not all vary by the same distance, but I considered all the other options and decided that this would be closest to their actual locations. I've also attempted to ease this error by regressing the points. Since we can only really measure trends in recorder bias, I think the current method is good enough. A better solution could be using visual anchors like faceoff circle/dot, goal line, blue line, instead of pos/neg x/y points to correct the recording bias, basing on the assumption that the recorders plot shot locations using those visual anchors, but I think even then, I would have to regress the points to a certain extent, and the difference between that method and my current method will be marginal.

2) The 5 feet radius is partly arbitrary. I decided upon 5 ft for 2 reasons. One, because it was the approximate distance from a player's stick blade to his skate, so a recorder could technically have a 5 feet margin of error either side depending on what handiness the player is. Two, because it was the largest distance bias a arena had (NYI with -4.3 and 3.3 ft on the positive end). The 75% exponential weighting was definitely arbitrary though. I'm not familiar with non-parametric regression, my understanding is that it selects weights based on the amount of data points available? Since I am not familiar with it, I can't say for sure, but since, like I mentioned before, we can really only measure trends in recorder bias, an improved regression method will most likely only have a minute effect on the data but seems to add a whole lot more in terms of complexity to the stat.

3) I agree with posting the interval, I think just including the lower bound confused some people. I probably will update the tables to include the probability and their intervals when I have the time. As for the Bayesian interval, I think it's parallel to confidence intervals and using one over an other would be essentially a lateral move. As for setting the prior to 0, I have no idea what you mean by that as, in my understanding, credible intervals relies completely on the prior to make an accurate prediction so setting them all to zero would make it useless? Maybe I'm misunderstanding something from your post.

4) What would a Corsi vs LAEGAP plot prove? How is that a measurement of the importance of shot location? It is 100% certain that shooting at a historically high percentage location will have a higher chance of going in versus shooting at a historically low percentage location. Again, maybe I'm misunderstanding something.

Quote:

Very interesting work. Thank you so much!

BTW, have you aggregated data from before 2011-12? I like how the results could be replicated between the last two seasons, but it would be even better to see how the metric stacks up over a larger selection of seasons. I've been flirting with some new statistics, and I've noticed the correlation from season to season can vary quite a bit.

EDIT:

To jump in the above conversation, I think the debate ends up being more philosophical to determine whether if shots taken, adjusted for location makes for a better metric than simple shot-taking. For instance, a team can spend a lot of time in the offensive zone, with the ultimate goal of producing lots of point shots; or attempt to produce relatively few shots from the slot. When we consider that sort of scenario, you can say a location-adjusted metric would be better since it accounts for the fact the latter team is trying to generate higher percentage plays than the former team.
I haven't compiled the data for seasons prior to 2011 yet, I am working on an other article that needs the past seasons data so when I finish that I will update and post more correlation plots for the older seasons.

It's actually not philosophical at all of whether adjusting for location make for a better metric. It is certain (assuming the location data is not so inaccurate that it resembles pure randomness, which it isn't). I think what you mean is that one team might try to go for more point shots than high percentage shots, and ends up scoring more goals because they were able to have so many shots. This is actually the core idea of LAEGAP/EG%. For example, if team A have 20 shots at the blue line, where the average shot percentage is .10, and team B had 4 shots at a .25 location, team A will have a better EGF (2 vs 1) than team B. Team A is expected to score 2 goal, and team B is expected to score 1 of a goal.

Quote:

So Corsi is a posession metric, rather than a shot metric. it just happens to use shots as the basis to determine possession. Would you say this is more of a scoring metric?
Every statistic that any one came up tries to predict one thing and one thing only, wins. Possession and Corsi is rated so highly because it has a strong correlation with winning, which means scoring more than your opponent. If you think about it logically, there are only 3 ways LAEGAP/EG% won't translate into wins:

1. the data is inaccurate.
2. players on the team have less than average shooting skills.
3. your goalie sucks

Corsi has all three exceptions, plus two:

1. your team shoots in low scoring areas (doesn't create good chances)
2. your team allows shots from high scoring areas (allows good chances against)

Quote:

who are the guys that can sustain a consistently higher than average shot quality?
That's something I am going to look at in the future. There's so much content and info to extract from this data and so many articles to write. I am pretty excited :laugh:

Quote:

The link says you're 17? I was not expecting such quality work to be done by somebody your age. Very impressive.
Ha thanks, I actually just turned 17 this month.

VinnyC 08-30-2013 02:16 AM

Quote:

I haven't compiled the data for seasons prior to 2011 yet, I am working on an other article that needs the past seasons data so when I finish that I will update and post more correlation plots for the older seasons.
Good to know!

Just wondering, how do you compile the shot distance data? I recall last month there being a thread on just that, and I don't think anyone came up with a satisfactory answer.

Quote:

Originally Posted by Wesleyy (Post 70642101)
It's actually not philosophical at all of whether adjusting for location make for a better metric. It is certain (assuming the location data is not so inaccurate that it resembles pure randomness, which it isn't). I think what you mean is that one team might try to go for more point shots than high percentage shots, and ends up scoring more goals because they were able to have so many shots. This is actually the core idea of LAEGAP/EG%. For example, if team A have 20 shots at the blue line, where the average shot percentage is .10, and team B had 4 shots at a .25 location, team A will have a better EGF (2 vs 1) than team B. Team A is expected to score 2 goal, and team B is expected to score 1 of a goal.

My apologies if I wasn't clear enough; I'm not contesting LAEGP might be a better win predictor than Corsi differential, I was just arguing their nature as possession predictors. Say a team has a neutral LAEGP - it roughly tells us the team is as efficient in its offensive possessions as it is inefficient when it is on the defensive - or vice-versa. It doesn't say much when it comes to how much a team has either owned the puck or let it be controlled by opponents, however. Corsi has its own shortcomings as a possession statistic, but we can all agree that taking a shot begets owning the puck, and letting a shot again begets not having the puck.

Not sure if we're all concerned with that, though. :laugh:

Ohashi_Jouzu 08-30-2013 02:54 AM

Quote:

Originally Posted by VinnyC (Post 70635473)
Problem with tracking passes is that hockey is much faster paced than soccer, and it's arguable if many common plays that end up with the puck going to another player can actually be called a "pass" (e.g. is a dump-in a pass? a deflection? a banked shot? a loose puck? a clearing attempt?)

Simply the number of touches would be fine, and it wouldn't have to be 100% controlled along the way, either, as long as the opponent doesn't gain possession. Just something different from a possession time statistic separated into zones of the ice (although that might be nice, too), or similar, like you see in soccer.

I dunno, I haven't really thought about it that much. Just a random thought that occasionally comes up.

The Legend 08-30-2013 03:24 AM

Quote:

Originally Posted by Wesleyy (Post 70594085)
http://hockeymetrics.net/introducing...ls-percentage/

Basically an improvement of corsi, it takes shot location into account on top of shot quantity. Took me over a month of work to finish, would love to hear your feedback.

It's a smart step; but will need to eventually be adjusted for opposition. I assume that certain teams have defensive styles that will make shots from different locations "higher quality". This works in a balanced schedule with balanced lines.

Blue Blooded 08-30-2013 04:55 AM

I find this really interesting, great work!

One question though; it seems like you have based the metric on shots, wouldn't it have been better to base it on Corsi/Fenwick instead?

1. You'd get a larger sample size for each player.

2. Just because a shot from an area is more likely to go in it doesn't mean that a shot attempt is more likely to do the same. There might be a higher frequency of blocks, or harder to hit the net from that area.

Point #2 is likely pretty (or completely) insignificant. But there is a reason Corsi and Fenwick are preferred over shots, shouldn't you have used one of/both of them instead?

Devilsfan992 08-30-2013 02:11 PM

Great work! :handclap: Still surprised your only 17 years old. This is Senior Year of College/Post-Grad work. I wish Pete Deboer could read this journal and realize he should be consistently starting Mark Fayne.


All times are GMT -5. The time now is 11:11 PM.

vBulletin Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
HFBoards.com, A property of CraveOnline, a division of AtomicOnline LLC ©2009 CraveOnline Media, LLC. All Rights Reserved.