The first thing I needed to do was to extract the data from the web pages. I wanted to extract the data in a suitable format so that I could load it into a
MySQL database. MySQL is just a great database, particularly now that they have the administration and query tools.
First of all I should mention explicitly the people who have collected all of this information. Everything that I do is possible because of their hard work in collecting all of this information together. They are:
In all cases above the copyright is shared between the authors and the
Rec.Sport.Soccer Statistics Foundation.
Because of the variety of authors above the format of the web page for each year varies slightly. This presents a small challenge in regularizes this data for analysis. However my experience tells me that there is no semi-structured data that
Perl can't help you regularize.
Rather than downloading the web pages and saving as HTML I decided to use the Perl
HTTP::Request module to download the web page as a block of text and then use a few regular expressions to extract the relevant match information.
I am mostly ignoring the tables and scorers, where added and just want to extract the results of each match. This typically results in a tuple consisting of the following key information:
- Match date
- Home Team
- Home Team Goals
- Away Team
- Away Team Goals
Looking at the raw web pages the format of how they represent results varies. For example from the 1992/93 page the results look like this:
15 August 1992
Arsenal 2 Norwich City 4
Bould, Campbell Robins 2, Phillips, Fox
Chelsea 1 Oldham Athletic 1
Harford Henry
Coventry City 2 Middlesbrough 1
Williams, Smith Wilkinson
Crystal Palace 3 Blackburn Rovers 3
Bright, Southgate, Osborn Ripley, Shearer 2
....
whereas in the 1996/97 page the results look like this:
Round 1: [9th,10th Aug 97]
09/08/97 Barnsley West Ham United 1 2 18667
09/08/97 Blackburn Rovers Derby County 1 0 23557
09/08/97 Coventry City Chelsea 3 2 22686
09/08/97 Everton Crystal Palace 1 2 35716
09/08/97 Leeds United Arsenal 1 1 37993
09/08/97 Leicester City Aston Villa 1 0 20304
09/08/97 Newcastle United Sheffield Wednesday 2 1 36711
09/08/97 Southampton Bolton Wanderers 0 1 15206
which leads to Perl regular expressions a bit like this:
/^([A-Z][A-Za-z\.\s]+)\s+(\d+)\s+([A-Z][A-Za-z\.\s]+)\s+(\d+)/
for the 1992/93 page.With a bit of to-ing and fro-ing it is possible to get the data into a CSV file (for testing) in the following format:
1992-8-15,Arsenal,2,Norwich City,4
1992-8-15,Chelsea,1,Oldham Athletic,1
1992-8-15,Coventry City,2,Middlesbrough,1
1992-8-15,Crystal Palace,3,Blackburn Rovers,3
1992-8-15,Everton,1,Sheffield Wednesday,1
1992-8-15,Ipswich Town,1,Aston Villa,1
1992-8-15,Leeds United,2,Wimbledon,1
which is precisely how we want it for further analysis.