View Single Post
03-18-2013, 04:43 AM
Cyclones Rock
Registered User
Join Date: Jun 2008
Posts: 3,394
vCash: 500
Originally Posted by 24get View Post
Here you have a basis for one game.
Go on Boxscore and then click on Play-by-play on the right.

From this you can extract the info:
  • Trick is to extract the headers and keep the raw data (report scrapping).
  • Then put it in a database.
  • The regenerate the play-by-play (to validate the process).
  • Then use the database to analyze the data.
Best would be to get raw data from NHL (in table format).
All their game reports comes from a database that is almost real time (probably materialized views).

It is not so simple but quite simple, a few tables in a database (maybe 20) should do the job (teams, seasons, games, players, players-date, event-type, events, events-players (seems NHL model only handle two players per event at the moment so this may not be needed) etc.).
It represents significant work (probably hundreds of hours not thousands).

Thing is that when you produce more reports of different style, it adds more time.
Also validation of the model is adding more work.

EDIT: BTW, I am not saying how it is done but how I would do it...
Thanks so much. I'm not a techie, so I'll just trust your estimate of perhaps thousands of hours. A heady undertaking for an individual.

It looks as if the players for each event are listed, but when there are significant time gaps between events, then it's not so easy to determine who was on the ice and who wasn't. There seems to be enough inherent error due to this that-at the very least- the stats derived from this need to be interpreted in a directional rather than in any absolute manner. To what extent I wouldn't even hazard a guess at this point.

Last edited by Cyclones Rock: 03-18-2013 at 04:53 AM.
Cyclones Rock is offline