First of all I should mention explicitly the people who have collected all of this information. Everything that I do is possible because of their hard work in collecting all of this information together. They are:
- 1992/1994 Michael Heneghan (bombchucker@hotmail.com)
- 1994/1995 Michael Heneghan, James Ross, Stephen Mulrine and Rob Monger
- 1995/1996 Michael Heneghan, Ken Fox, Stephen Mulrine and Rob Monger
- 1996/1997 Malcolm Hodgson (malc@hodgson13.freeserve.co.uk)
- 1997/1998 Ian King
- 1998/1999 RSSSF (rsssf-request@isfa2.com)
- 1999/2000 Ken Butler (butler@mscs.dal.ca), Stephen Mulrine (stephen@moroder.scotnet.co.uk) and Andreas Exenberger (Andreas.Exenberger@uibk.ac.at)
- 2000/2001 Ian King (soccerking@members.v21.co.uk) and Stephen Mulrine
- 2001/2002 Ian King (worldsoccer@members.v21.co.uk), Daniel Dalence Garcia (ddgarcia@terra.com.br) and Matthew Perry (matthewperry@ntlworld.com)
- 2002/2003 an King (worldsoccer@members.v21.co.uk), Matthew Perry (matthewperry@ntlworld.com) and Markus Soleim
- 2003/2004 Ian King (soccerking@members.v21.co.uk) and Matthew Perry (matthewperry@ntlworld.com)
- 2004/2005 Ian King (soccerking@members.v21.co.uk) and Matthew Perry (matthewperry@ntlworld.com)
- 2005/2006 Daniel Dalence (danielballack@terra.com.br), Gurgen Mahari (gmahari@arminco.com) and Ian King (soccerking@members.v21.co.uk)
- 2006/2007 Ian King (i-king@sky.com)
- 2007/2008 Ian King
Because of the variety of authors above the format of the web page for each year varies slightly. This presents a small challenge in regularizes this data for analysis. However my experience tells me that there is no semi-structured data that Perl can't help you regularize.
Rather than downloading the web pages and saving as HTML I decided to use the Perl HTTP::Request module to download the web page as a block of text and then use a few regular expressions to extract the relevant match information.
I am mostly ignoring the tables and scorers, where added and just want to extract the results of each match. This typically results in a tuple consisting of the following key information:
- Match date
- Home Team
- Home Team Goals
- Away Team
- Away Team Goals
15 August 1992
Arsenal 2 Norwich City 4
Bould, Campbell Robins 2, Phillips, Fox
Chelsea 1 Oldham Athletic 1
Harford Henry
Coventry City 2 Middlesbrough 1
Williams, Smith Wilkinson
Crystal Palace 3 Blackburn Rovers 3
Bright, Southgate, Osborn Ripley, Shearer 2
....
whereas in the 1996/97 page the results look like this:
Round 1: [9th,10th Aug 97]
09/08/97 Barnsley West Ham United 1 2 18667
09/08/97 Blackburn Rovers Derby County 1 0 23557
09/08/97 Coventry City Chelsea 3 2 22686
09/08/97 Everton Crystal Palace 1 2 35716
09/08/97 Leeds United Arsenal 1 1 37993
09/08/97 Leicester City Aston Villa 1 0 20304
09/08/97 Newcastle United Sheffield Wednesday 2 1 36711
09/08/97 Southampton Bolton Wanderers 0 1 15206
which leads to Perl regular expressions a bit like this:
/^([A-Z][A-Za-z\.\s]+)\s+(\d+)\s+([A-Z][A-Za-z\.\s]+)\s+(\d+)/
for the 1992/93 page.With a bit of to-ing and fro-ing it is possible to get the data into a CSV file (for testing) in the following format:
1992-8-15,Arsenal,2,Norwich City,4
1992-8-15,Chelsea,1,Oldham Athletic,1
1992-8-15,Coventry City,2,Middlesbrough,1
1992-8-15,Crystal Palace,3,Blackburn Rovers,3
1992-8-15,Everton,1,Sheffield Wednesday,1
1992-8-15,Ipswich Town,1,Aston Villa,1
1992-8-15,Leeds United,2,Wimbledon,1
which is precisely how we want it for further analysis.
No comments:
Post a Comment