What a diff'rence xG makes
One argument in favour of using xG in football analytics and punditry is that it gives a better idea of which teams are good and which teams are not. Supposedly, xG allows us to cut through some of the randomness of goals and get closer to seeing teams’ true strengths. I find this view pretty intuitive; however, intuition alone is not enough to make the argument.
In this post, I present an analysis evaluating this claim. By comparing a team strength model using goals with one that uses xG, I have attempted to estimate how much better xG makes our understanding of team’s abilities. The code required to run the analysis is embedded in the post.
The team strength models will be the “Vanilla” and “xG” models described in the previous post.
Preparing the data
First, we have to fetch the data. I’ve put the data required for this analysis up onto github and we can read in the data directly from the shortlink.
Nesting the data allows us to keep the data tidy so that each row contains a single Premier League game, including a dataframe of all the shots in that game (contained in the shots
column).

## # A tibble: 6 x 9
## match_id date home away hgoals agoals league season
## <int> <dttm> <chr> <chr> <int> <int> <chr> <int>
## 1 4749 20140816 12:45:00 Manches… Swans… 1 2 EPL 2014
## 2 4750 20140816 15:00:00 Leicest… Evert… 2 2 EPL 2014
## 3 4751 20140816 15:00:00 Queens … Hull 0 1 EPL 2014
## 4 4752 20140816 15:00:00 Stoke Aston… 0 1 EPL 2014
## 5 4753 20140816 15:00:00 West Br… Sunde… 2 2 EPL 2014
## 6 4754 20140816 15:00:00 West Ham Totte… 0 1 EPL 2014
## # ... with 1 more variable: shots <list>
To get the information from the xG data into the DixonColes model, I’m using the approach established in my previous post. This method uses individual shot xG values to estimate the probability of different scorelines occurring. Each scoreline can then be fed into the model as if it were an individual game, with each scoreline weighted by its probability of occurring.
Previously, I simulated the games via Monte Carlo. However, Marek Kwiatowski helpfully pointed out that, using this method, each team’s goals will follow a PoissonBinomial distribution. This means that we can estimate the scoreline probabilities analytically. Using the PoissonBinomial distribution is also faster than MonteCarlo simulation.
Handily, functions for working with the PoissonBinomial distribution are available in the poisbinom
R package, released to CRAN last year.

## # A tibble: 6 x 6
## match_id home away hgoals agoals prob
## <int> <chr> <chr> <int> <dbl> <dbl>
## 1 4749 Manchester United Swansea 0 0 0.165
## 2 4749 Manchester United Swansea 1 0 0.335
## 3 4749 Manchester United Swansea 2 0 0.184
## 4 4749 Manchester United Swansea 3 0 0.0516
## 5 4749 Manchester United Swansea 4 0 0.00894
## 6 4749 Manchester United Swansea 5 0 0.00104
Comparing the models
With the data required to fit both the Goals and xG models, we can get to work comparing them.
We can compare both the goals and xG models with a backtest. This means testing how well each model would have predicted games in the past. In other words, using only information available at that time, how well does the model perform over our historical data:
 For each game…
 Find preceding Premier League games within the last year
 Then, for each model…
 Fit the model on the preceding games
 Make a prediction for that game
 Calculate the prediction’s accuracy against the actual outcome
 Finally, aggregate the total accuracy for each model over all games
No teams play twice in one same day. So we can actually fit the models for each day, rather than for each game. This has the advantage of being quicker to run but is otherwise equivalent to going gamebygame.
Find previous games
First, let’s find the previous games within a year of each game. We’ll use these games to fit a model as if we were at that point in time.
I’ve chosen a window of 1 year in the past for the models to fit on. This is a somewhat arbitrary choice; it seems likely that the historical window can be tweaked to improve the performance of the models. In other words, the models may perform better when fitted on games within 270 days of the last fixture, rather than 365.
However, that is a slightly different analysis. I also suspect that the optimal time window for the xGbased model and the Goals model will be different.

## # A tibble: 6 x 2
## game_date matches
## <date> <list>
## 1 20150822 <int [390]>
## 2 20150823 <int [396]>
## 3 20150824 <int [393]>
## 4 20150829 <int [390]>
## 5 20150830 <int [398]>
## 6 20150912 <int [390]>
Fitting each model
I’m using the regista
R package to fit the models. This is not available on CRAN but is freely available on Github, and can be installed in R like so:

The models I’ll be evaluating are as follows:
DixonColes
 Vanilla DixonColes model using only goals to estimate team strength
DixonColes xG
 DixonColes model using xG values (via simulation)
DixonColes xG (rho)
 DixonColes model using xG values, with the
rho
dependence parameter taken from the vanilla DixonColes model.
 DixonColes model using xG values, with the
The reasoning and methododology behind the models is explained in a bit more detail here.

## # A tibble: 6 x 4
## game_date matches model fitted
## <date> <list> <chr> <list>
## 1 20150822 <int [390]> DixonColes <S3: dixoncoles>
## 2 20150823 <int [396]> DixonColes <S3: dixoncoles>
## 3 20150824 <int [393]> DixonColes <S3: dixoncoles>
## 4 20150829 <int [390]> DixonColes <S3: dixoncoles>
## 5 20150830 <int [398]> DixonColes <S3: dixoncoles>
## 6 20150912 <int [390]> DixonColes <S3: dixoncoles>
Making predictions
Make predictions for each date with each model. While there are different types of predictions we could make about a football match, I’m sticking to outcome (Home/Draw/Away).
This is by no means the best way to evaluate a soccer model; however it has a couple of advantages in this case. One is that it’s relatively easy to understand. Another is that public H/D/A predictions and closing odds are available online, which makes the model predictions easier to benchmark.

## # A tibble: 6 x 5
## game_date matches model fitted predictions
## <date> <list> <chr> <list> <list>
## 1 20150822 <int [390]> DixonColes <S3: dixoncoles> <tibble [18 × 3]>
## 2 20150823 <int [396]> DixonColes <S3: dixoncoles> <tibble [9 × 3]>
## 3 20150824 <int [393]> DixonColes <S3: dixoncoles> <tibble [3 × 3]>
## 4 20150829 <int [390]> DixonColes <S3: dixoncoles> <tibble [24 × 3]>
## 5 20150830 <int [398]> DixonColes <S3: dixoncoles> <tibble [6 × 3]>
## 6 20150912 <int [390]> DixonColes <S3: dixoncoles> <tibble [21 × 3]>
Evaluating the models
We can evaluate the models’ predictions using the log loss metric.

## # A tibble: 3 x 2
## model log_loss
## <chr> <dbl>
## 1 DixonColes 0.594
## 2 DixonColes xG 0.573
## 3 DixonColes xG (rho) 0.574
Of course, these numbers don’t mean much on their own. What does a 0.02 difference in log loss actually mean?
To put these into context, I’ve calculated the log loss for a few benchmark models. I haven’t shown the benchmarks inline, but the code to calculate them is available here
Benchmark
 Assume all teams are the same strength and predict outcomes in line with historical frequencies (approximately H = 45%, D = 25%, A = 30%)
Closing odds
 Implied probabilities from Pinnacle closing odds (from footballdata.co.uk). You’re probably not going to get too close to these with public models/data.
Marketratings
 Team strength estimates derived from historical closing odds. An explanation of this method and links to code can be found here
Comparing the predictions

Comparing the predictions we see that the increase in predictive accuracy we get from using xG over goals is similar to the difference between using goals scored/conceded vs no team strength information at all.
However, this increase in predictive accuracy applies to computers, not humans. And if we go back to the initial claim that this post is supposed to be about, the implication is clearly that xG provides real value to humans trying to understand the game.
I think this is more of an open question; real people watching a game of football can pay attention to more than just the score. However, most fans, pundits, and analysts can’t watch every game in the season. In those cases, xG provides real and significant value over just looking at results.
While people can’t watch every game, and we still have people suggesting that “the table doesn’t lie”, I think there’s room for xG (or similar) to provide helpful insight. How much extra value it provides, may depend on how well you can incorporate information beyond results.