Hey all, I'm new to the By The Numbers forum, but as an avid sports bettor with a strong background in math, I've always been intrigued by the applications of statistical analysis for sports betting. My idea is to create a detailed multiple regression analysis for game-by-game outcomes using data from the last 3 seasons.

The aim of the analysis is to create a percent chance of victory for each team. The design I have in mind is an output variable that would produce the percent chance of victory for the home team based on a series of dependent variables.

Thus far, the variables I would like to include:

- Season-to-date win%

- Rolling win% over the last 10 games

- Season-to-date starting goalie SVPCT

- Rolling save % over the last 10 games for said goalie

- Historical goalie SVPCT vs opponent

- Team days of rest

- Goaltender days of rest

- Travel distance from prior destination

- Rolling 10 game head to head win%

- Season-to-date Corsi or Fenwick

- Rolling 10 game Corsi or Fenwick

- Season-to-date team shooting %

- Rolling 10 game shooting %

I'd be open to exploring other variables to add or removing any of the current ones. The 10 games is currently an arbitrary number. I'm certain that this can be optimized to produce the best coefficient, but I think 10 is a good starting point, at least as arbitrary numbers go.

The objective is to run a multiple regression and find which variables are best at predicting a victory, and testing other variables to determine which ones would produce the strongest r and r^2.

My questions are as follows:

- Is there an easier way to obtain this data than to do it manually? I know I can get the season results from hockey-reference, but then I would have to go into each game manually to get the goalie #s. I have no idea whatsoever how to get the Corsi #s.

- Are there any additional variables that you would suggest as useful?

Thanks in advance, folks. As I said, I'm relatively new here, so I apologize if I'm missing things that are obvious to the rest of you.

Edited to add: Special teams #s would also likely be useful. I'm also aware that this is not to be a perfect system. The idea is to find statistically significant discrepancies between Vegas odds and winning percentages based on the regression analysis. The simplest example would be a team with a standard -110 line showing up with a 75% winning percentage based on the regression model, with a strong r. In this case (depending on the exact r, it seems that we would have a positive ROI, which can be calculated on a per-game basis pretty easily)

]]>The aim of the analysis is to create a percent chance of victory for each team. The design I have in mind is an output variable that would produce the percent chance of victory for the home team based on a series of dependent variables.

Thus far, the variables I would like to include:

- Season-to-date win%

- Rolling win% over the last 10 games

- Season-to-date starting goalie SVPCT

- Rolling save % over the last 10 games for said goalie

- Historical goalie SVPCT vs opponent

- Team days of rest

- Goaltender days of rest

- Travel distance from prior destination

- Rolling 10 game head to head win%

- Season-to-date Corsi or Fenwick

- Rolling 10 game Corsi or Fenwick

- Season-to-date team shooting %

- Rolling 10 game shooting %

I'd be open to exploring other variables to add or removing any of the current ones. The 10 games is currently an arbitrary number. I'm certain that this can be optimized to produce the best coefficient, but I think 10 is a good starting point, at least as arbitrary numbers go.

The objective is to run a multiple regression and find which variables are best at predicting a victory, and testing other variables to determine which ones would produce the strongest r and r^2.

My questions are as follows:

- Is there an easier way to obtain this data than to do it manually? I know I can get the season results from hockey-reference, but then I would have to go into each game manually to get the goalie #s. I have no idea whatsoever how to get the Corsi #s.

- Are there any additional variables that you would suggest as useful?

Thanks in advance, folks. As I said, I'm relatively new here, so I apologize if I'm missing things that are obvious to the rest of you.

Edited to add: Special teams #s would also likely be useful. I'm also aware that this is not to be a perfect system. The idea is to find statistically significant discrepancies between Vegas odds and winning percentages based on the regression analysis. The simplest example would be a team with a standard -110 line showing up with a 75% winning percentage based on the regression model, with a strong r. In this case (depending on the exact r, it seems that we would have a positive ROI, which can be calculated on a per-game basis pretty easily)

Has anyone attempted to adjust possession numbers to be more forgiving of players who get tough zone starts and QoC? Or the opposite - to bring down to earth the numbers of players who had an easier time. A lot of times you see 2nd and 3rd pairing players who show up well in Corsi but it's because of how they've been utilized (look at Mike Green last season, for example).

I think the first step would be a "team neutral zone start %" stat. I'd be surprised if someone hadn't done this yet. On a bad team you can actually see everyone with at least x GP, sitting at a max of 48-49% zone starts because of how bad the team is. It can make it look like a player was utilized in a defensive role or a balanced role, but 48-49% could be the highest on a a bad team where most other players sit at 39-43%. Basically, 50% on Edmonton doesn't mean the same thing as 50% on Washington, not even close.

Some degree of correlation would have to be demonstrated between team neutral zone start% and relative QoC and Corsi. It's obvious both of them relate directly to Corsi in a large sample, and I'm interested in which players really transcend that, which players really outperform the situations they were placed in. That's where I struggle, because, how do you adjust their Corsi? What factor do you use? I don't think this is as easy as an adjusted goals or adjusted sv% formula and I feel like anything I come up with would be somewhat arbitrary. Is there a mathematically and statistically sound way to attempt this?

]]>I think the first step would be a "team neutral zone start %" stat. I'd be surprised if someone hadn't done this yet. On a bad team you can actually see everyone with at least x GP, sitting at a max of 48-49% zone starts because of how bad the team is. It can make it look like a player was utilized in a defensive role or a balanced role, but 48-49% could be the highest on a a bad team where most other players sit at 39-43%. Basically, 50% on Edmonton doesn't mean the same thing as 50% on Washington, not even close.

Some degree of correlation would have to be demonstrated between team neutral zone start% and relative QoC and Corsi. It's obvious both of them relate directly to Corsi in a large sample, and I'm interested in which players really transcend that, which players really outperform the situations they were placed in. That's where I struggle, because, how do you adjust their Corsi? What factor do you use? I don't think this is as easy as an adjusted goals or adjusted sv% formula and I feel like anything I come up with would be somewhat arbitrary. Is there a mathematically and statistically sound way to attempt this?