Tuesday 25 September 2007

Loading Premiership Data into MySQL

MySQL makes it nice and easy to load the CSV files in. First create a table as follows:

drop table if exists matches;

create table matches ( matchdate date not null,
hometeam varchar(25) not null,
homescore smallint not null,
awayteam varchar(25) not null,
awayscore smallint not null

create index id_matches on matches( matchdate, hometeam, awayteam);

It is then easy to load the date in using the load data command:

into table rsssf.matches
fields terminated by ','
ignore 8 lines
( matchdate, hometeam, homescore, awayteam, awayscore );

It is then straightforward to perform simple queries on the database. For example all the home derbies for Liverpool:

select matchdate, hometeam, homescore, awayscore, awayteam
from matches
where hometeam = "Liverpool" and awayteam = "Everton"
order by 1;

gives the result at the top of this post.

Monday 24 September 2007

Checking the premiership results

I decided to load the data from CSV files so that I could do some intermediate checks on the data before loading it.

I only did some rudimentary checks on each year separately and then all of the Premiership results together. The checks included:
  • Counting the total number of matches in the file - should equal (n*n)-n where n is the number of teams in the league - 20 post-1996, 22 prior to that
  • Counting the number of home and away games for each team and checking that they were equal
  • Depending on the year checking that they had played a multiple of 42 (pre 1996) or 38 games (1996 onwards)
  • Looking at the list of names and making sure that there were not synonyms - if so going back to my table of cleansing transformations

Clean Premiership Results

The clean CSV files can be downloaded from the following links:
There is also one zip archive that contains all of the above files.

All of this data remains copyright of the original authors and the RSSSF who are all credited in the files.

Correcting team names

Due to the number of different authors on the RSSSF site we also get a number of different spellings and punctuation associated with team names. Therefore as part of the extraction process I clean the team names so that they are all consistent across years using this following Perl associative array.

my %clean =
( "A Villa" => "Aston Villa", # eng99 page
"Blackburn" => "Blackburn Rovers", # eng02 page
"Blackburn R" => "Blackburn Rovers", # eng99 page
"Blackburn Rov" => "Blackburn Rovers", # eng98 page
"Bolton" => "Bolton Wanderers", # eng02 page
"Bradford" => "Bradford City", # eng01 page
"Charlton" => "Charlton Athletic", # eng01,eng02 page
"Charlton A" => "Charlton Athletic", # eng99 page
"Coventry" => "Coventry City", # eng00 eng01 page
"Coventry C" => "Coventry City", # eng00 eng01 page
"Derby" => "Derby County", # eng00 eng01 eng02 page
"Derby C" => "Derby County", # eng99 page
"Derby Co" => "Derby County", # eng99 page
"Derby Co." => "Derby County", # eng99 page
"Ipswich" => "Ipswich Town", # eng00 eng01 eng02 page
"Leeds" => "Leeds United", # eng00 eng01 eng02 page
"Leeds U" => "Leeds United", # eng00 eng01 eng02 page
"Leeds Utd" => "Leeds United", # eng99 page
"Leicester" => "Leicester City", # eng00 eng01 eng02 page
"Leicester C" => "Leicester City", # eng99 page
"Man City" => "Manchester City", # eng01 page
"Manchester U" => "Manchester United", # eng06 page
"Manchester Utd" => "Manchester United", # eng99 page
"Man Utd" => "Manchester United", # eng01 page
"Man. Utd." => "Manchester United", # eng01 page
"Man United" => "Manchester United", # eng01 page
"Middlesbro" => "Middlesbrough", # eng01 page
"Manchester" => "Manchester United", # eng00 page
"Newcastle" => "Newcastle United", # eng98 page
"Newcastle U" => "Newcastle United", # eng00 eng01 eng02 page
"Newcastle Utd" => "Newcastle United", # eng00 eng01 eng02 page
"Nottm Forest" => "Nottingham Forest", # eng99 page
"Nottingham F" => "Nottingham Forest", # eng99 page
"Sheff. Wed." => "Sheffield Wednesday", # eng00 page
"Sheffield Wed" => "Sheffield Wednesday", # eng00 page
"Sheffield W" => "Sheffield Wednesday", # eng00 page
"Sheffield W." => "Sheffield Wednesday", # eng00 page
"Tottenham" => "Tottenham Hotspur", # eng00 eng01 eng02 page
"Tottenham H" => "Tottenham Hotspur", # eng99 page
"West Ham" => "West Ham United", # eng00 eng01 eng02 page
"West Ham U" => "West Ham United", # eng99 page
"West Ham Utd" => "West Ham United", # eng99 page
"Wigan" => "Wigan Athletic" # eng06 page

The comments after the mapping shows the page that that name is used on. The names on the left hand side are the variants and the right hand side are the names that I am mapping to.

Getting the football results data

The first thing I needed to do was to extract the data from the web pages. I wanted to extract the data in a suitable format so that I could load it into a MySQL database. MySQL is just a great database, particularly now that they have the administration and query tools.

First of all I should mention explicitly the people who have collected all of this information. Everything that I do is possible because of their hard work in collecting all of this information together. They are:
In all cases above the copyright is shared between the authors and the Rec.Sport.Soccer Statistics Foundation.

Because of the variety of authors above the format of the web page for each year varies slightly. This presents a small challenge in regularizes this data for analysis. However my experience tells me that there is no semi-structured data that Perl can't help you regularize.

Rather than downloading the web pages and saving as HTML I decided to use the Perl HTTP::Request module to download the web page as a block of text and then use a few regular expressions to extract the relevant match information.

I am mostly ignoring the tables and scorers, where added and just want to extract the results of each match. This typically results in a tuple consisting of the following key information:
  • Match date
  • Home Team
  • Home Team Goals
  • Away Team
  • Away Team Goals
Looking at the raw web pages the format of how they represent results varies. For example from the 1992/93 page the results look like this:

15 August 1992
Arsenal 2 Norwich City 4
Bould, Campbell Robins 2, Phillips, Fox
Chelsea 1 Oldham Athletic 1
Harford Henry
Coventry City 2 Middlesbrough 1
Williams, Smith Wilkinson
Crystal Palace 3 Blackburn Rovers 3
Bright, Southgate, Osborn Ripley, Shearer 2

whereas in the 1996/97 page the results look like this:

Round 1: [9th,10th Aug 97]

09/08/97 Barnsley West Ham United 1 2 18667
09/08/97 Blackburn Rovers Derby County 1 0 23557
09/08/97 Coventry City Chelsea 3 2 22686
09/08/97 Everton Crystal Palace 1 2 35716
09/08/97 Leeds United Arsenal 1 1 37993
09/08/97 Leicester City Aston Villa 1 0 20304
09/08/97 Newcastle United Sheffield Wednesday 2 1 36711
09/08/97 Southampton Bolton Wanderers 0 1 15206

which leads to Perl regular expressions a bit like this:


for the 1992/93 page.With a bit of to-ing and fro-ing it is possible to get the data into a CSV file (for testing) in the following format:

1992-8-15,Arsenal,2,Norwich City,4
1992-8-15,Chelsea,1,Oldham Athletic,1
1992-8-15,Coventry City,2,Middlesbrough,1
1992-8-15,Crystal Palace,3,Blackburn Rovers,3
1992-8-15,Everton,1,Sheffield Wednesday,1
1992-8-15,Ipswich Town,1,Aston Villa,1
1992-8-15,Leeds United,2,Wimbledon,1

which is precisely how we want it for further analysis.

Predicting Football Results

My work involves data analysis and prediction of outcomes for a variety of customers including telecommunications: churn, financial services: up-sell and cross-sell, retailers: spend, spend, spend. So I think a lot about how you can apply data analysis to a number of different problems. In particular if there are large amounts of data then this makes it more intersting.

I have been familiar with the rec.sport.soccer.statistics.foundation for quite some time as they are regularly referenced by the one true daily tea-time football email - The Fiver. They collect an enormous amount of football statistical information and make it available on the internet. Sometimes this is just the final results for historical league and cup competitions, however for the English premiership they have the results for all games from 1992. This was when the First Division was renamed as the premiership.

I had been persuaded by some friends to join the Guardian Pick the Score competition and discovered that I was particularly bad at this. So I have decided to apply some of my knowledge of data analysis to apply it to this problem.

I haven't really done any research into this yet - I'm convinced that others will have done this already with varying degrees of success. how ever I think it should be fun so I plan to document my progress here on the blog.

Scientifc Marketer

A friend/colleague of mine has started a very interesting blog on scientific marketing techniques. In particular he focuses on the incremental effects of campaigns and removing negative effects.

If you want to read more see http://scientificmarketer.com