Adjusted Goals – creating a simple expected goals model
The capture of sophisticated data from football matches has burgeoned over recent years, enabling the creation of fascinating models to estimate the quality of chances created and conceded. These “Expected Goals” models use multiple information about each chance (e.g. distance, angle and type of shot) to assess the expected goals scored by each team in a particular match.
Expected Goals help interpret what happened in a match – e.g. did the actual score reflect the quality of chances created? They’ve also proved a strong predictor of a team’s future performance, better than other measures such as goal difference or points accumulated. A fantastic property of expected goals is that they simply encapsulate a match outcome in a common currency (goals), but take no account of the actual goals scored. So they fulfill a common desire to make better sense of the actual match score-line by revealing a team’s “underlying numbers”.
Expected goals models represent a great progression in the analysis of football and they’re still developing. But, for me, their biggest problem is accessibility of data. Full long term data to enable an assessment of chances created is not freely available (or I’ve just not found it). Expected Goals models tend to use Opta data, which (understandably due to costs of data collection) is not fully freely available. Unfortunately, this limits the development of models.
I don’t have access to Opta data (or the analytical skills to use it), but still wanted to create a model that:
- Helps assess a team’s attacking and defensive performance over the short-term and long-term.
- Works well at predicting future performance (or at least better than goal difference or points accumulated).
- Can be applied across different leagues in different countries.
- Is relatively simple to calculate, understand and replicate
- Uses freely available and continually updated data
The best data source I’ve found is the excellent www.football-data.co.uk , which carries data for most European leagues, goes back more than 20 years, is regularly updated and enables easy upload to Excel on CSV files. These files contain pretty rich information, including: full time and half time scores, shots, shots on target, fouls, cards and corners.
For my model I’ve mainly concentrated on goals, shots and shots on target – key ingredients for team performance analysis and subject to plenty of previous debate and scrutiny. I couldn’t find much use for corners. I’m sure that cards and fouls are helpful to explain match outcomes, but I haven’t used these either.
Starting with the “Shots” figure – this should represent all the goal attempts for a particular team. And, using the other data, shots can be split into 3 separate elements that should each tell us something about the relative quality of a team’s goal attempts. That is:
- Total shots = goals + saved shots on target + shots off target
- Saved shots on target = shots on target – goals [i.e. all on- target shots saved or blocked]
- Shots off target = shots – shots on target
[Actually, this isn’t strictly true because own goals won’t count as shots – but as I don’t have the data I’ve ignored this]
Intuitively, of the three, goals should represent the highest quality chances (a goal has actually been scored after all) followed by saved shots on target and finally shots off target. So, a way to calculate a simple expected goals estimate would be to apply an expectation factor to each of the 3 elements. So that:
Simple expected goals = X*goals + Y*saved shots on target + Z*shots off target
With X ≥ Y ≥ Z
If this is the case, then it will simplify to
Simple expected goals = A*goals + B*Shots on target + C*shots, which is easier to work with, because the data is available in these measures.
What are the factors?
To determine usable factors for A,B and C I’ve tried to find values that correlate well with future points – over short, medium and long-term, and also perform well against single measures such as goals or points, and across different leagues in different countries.
However, the first hurdle to overcome is inconsistencies in year on year data. For English Premier League something strange happened in 2013, when average shots on target fell from 14.24 to 8.92 per game. A 37% fall is unlikely to be due solely to a change in the pattern of play (e.g. better defending) and more likely to be caused by data issues. And, sure enough (after a quick search) it appears that the data source changed in 2013 and the fall is due to a revised definition of what counts as a shot on target.
|Average Shots on target per game||% Difference from previous year|
This inconsistency makes it difficult to compare the performance of the model in different years. So to ensure that the weighting of Goal/Shots on target/Shots remains consistent year on year I’ve standardized the data. This consists of applying a factor so that the average for each season’s data equals the long-term average for the top European Leagues.
|Average values from seasons starting 2009 to 2014 (apart from EPL where data is 2013 and 2014 seasons due to data inconsistency).|
|Top League||Goals||Shots on target||Shots|
I’ve modified all data using the averages in the table above. So for example in EPL season 2012/13 where averages per game were 2.80 goals, 14.24 shots on target and 25.14 shots – I’ve applied the following factors: 2.68/2.80, 8.90/14.24 and 25.72/24.14 respectively.
Using the standardized data, factors that work well are A = 45.01%, B = 8.44%, C = 2.81%
So, for the averages in the table above:
A*goals + B*Shots on target + C*shots =
45.01%*2.68 + 8.44%*8.90 + 2.81%*25.72 = 2.68
These factors apply to the averages, although they differ slightly across leagues due to the data standardization described above. For example in Germany the factors are lower because goals and shots on target tend to be higher. For example the factors I’m currently using for Germany are 43.09%, 7.96% and 2.79%.
Does it work?
I’ve used the factors above to test performance against future points correlation, and compared against other single measure of performance. I’ve tested correlation against points, because that’s what teams are actually playing to win.
I’ve called my derived measure Adjusted goals, rather than Expected goals – because actual goals scored are such a significant component of the measure, i.e. >45% is really too high to reflect the true average likelihood of a particular chance resulting in a goal. However, placing a high weighting on goals does have advantages for a performance rating model – which I’ll discuss later.
The results are as follows
The first graph shows correlation coefficient between the first half of a season and the second for the top division in the big European 5 leagues. So, for example, in the English Premier league I’ve taken the average of each measure in the first 19 games and tested the correlation against points in matches 20-38. For Germany it’s the first 17 matches against 18-34, because there are only 18 teams. The results are averages across 6 seasons from 2009/10 to 2014/15.
It’s interesting how the different measures compare across different leagues. The English and Spanish leagues are by far the most predictable, with all measures correlating better than for France. Of the single measures, Shots on Target do well across the board, but Goals come into their own in Italy and France. This is likely to be due to different styles of play, something I’ll try to look at later. Points and shots generally perform poorly.
In all leagues adjusted goals works best, other than Germany where it’s second. I’ve also looked at shorter term performance. For example, how does data collected from the 6 matches preceding the second half of the season perform? (i.e. using 6 games rather than 19 to compare against points in the second half of the season).
As you’d expect over the shorter term correlations are lower, but comparatively shots start to perform much better over this shorter period. This is to be expected, because shots are much more frequent than goals so true underling performance is revealed sooner. However, pleasingly, adjusted goals also correlated better using shot-term data (other than for Germany where shots on target just wins out).
The averages for these measures, across these leagues are as follows:
Adjusted goals is better than other single predictive measures – both over the long and short term.
I also looked at how the lower English leagues performed (i.e. Championship, League 1, League 2 and Conference).
Note how low all the correlation coefficients are for the lower English Leagues. In 2012 The Championship was a particularly strange year, where points correlation between the first and second half of the season was -3% (and other measures didn’t perform much better either). Factors other than underlying goal and shot performance have a much stronger influence in the lower leagues. I suspect one reason is that teams have limited resources so are more influenced by shorter term factors such as injuries to key players (which may also help in explaining why the English Premiership displays the highest correlation). This is something I want to explore further.
However, even allowing for the smaller correlations in the lower leagues, comparatively Adjusted goals still performs best. So it still works as a measure for assessing a team’s performance (but in the context of understanding other factors too).
I’ve also compared longer term correlation from one season to next (for teams that remain in the division). The results are as follows.
The average season to season correlations are spectacular for the top English and Spanish leagues, perhaps reflecting how established the hierarchy of these leagues has become. Long-term correlation is also highest for all the top leagues, although not so for 3 of the 4 lower English leagues.
How can adjusted goals be used?
The adjusted goal model performs in the way I wanted. It appears a better predictor than any of the single measures over the short and long term, and works well across different leagues (although it tends to work better in higher quality leagues). I doubt it works as well as the best Expected Goals models, but I’m unable to compare against these.
It’s a simple estimate of expected goals, albeit with a relatively high weighting on actual goals scored. Even so, I think that adjusted goals do well at evaluating the outcome of a particular match.
The advantage of taking account of actual goals is that goals change matches. For example, if a team scores a goal to take the lead – they don’t need to score any more, as long as they don’t concede. So, once a team takes the lead, it makes sense to divert more resources to stopping their opponents creating chances –some teams do this, and if they’re successful it reduces the number of chances for the rest of the match. Genuine expected goals models (as I understand them) take no account of whether a chance results in an actual goal – this means that teams that deliberately (and successfully) play defensively after taking the lead can be undervalued by true expected goals models.
I’m using adjusted goals to create long-term and short term adjusted goals ratings for each team’s attack, defence and difference (i.e. averages over different time spans). This enables a relative assessment of each team.
Adjusted goal ratings indicate potential changes to long-term attacking, defensive or overall performance. They can’t determine the reasons for these changes (or whether they’re just random noise). But they can indicate where further analysis is needed – for example, have a new manager, new tactics, key injuries or new players affected performance? I can also use adjusted goal ratings for future prediction – either individual match outcome or longer-term table positions.
Examples of graphical analysis are:
EPL at 27/11/2015
Strong improvements in Spurs’ defence (at 27/11/2015)
I’ll write some more on the practical application of the adjusted goals model.
Sources I’ve referenced when developing model
https://mcofa.wordpress.com/ Explanation of expected goals and tweets great Expected Goals maps after matches
https://jameswgrayson.wordpress.com/ Fantastic analysis, constructing a performance rating from simple data, and comparison against expected goals
http://www.football-data.co.uk data source for the model
There’s loads of other superb stuff on expected goals online.