What a diff'rence xG makes

One argument in favour of using xG in football analytics and punditry is that it gives a better idea of which teams are good and which teams are not. Supposedly, xG allows us to cut through some of the randomness of goals and get closer to seeing teams’ true strengths. I find this view pretty intuitive; however, intuition alone is not enough to make the argument.

In this post, I present an analysis evaluating this claim. By comparing a team strength model using goals with one that uses xG, I have attempted to estimate how much better xG makes our understanding of team’s abilities. The code required to run the analysis is embedded in the post.

The team strength models will be the “Vanilla” and “xG” models described in the previous post.

Preparing the data

First, we have to fetch the data. I’ve put the data required for this analysis up onto github and we can read in the data directly from the shortlink.

Nesting the data allows us to keep the data tidy so that each row contains a single Premier League game, including a dataframe of all the shots in that game (contained in the shots column).

library(tidyverse)
games <-
read_csv("https://git.io/fNmRy") %>%
nest(side, xg, .key = "shots")
head(games)
## # A tibble: 6 x 9
##   match_id date                home     away   hgoals agoals league season
##      <int> <dttm>              <chr>    <chr>   <int>  <int> <chr>   <int>
## 1     4749 2014-08-16 12:45:00 Manches… Swans…      1      2 EPL      2014
## 2     4750 2014-08-16 15:00:00 Leicest… Evert…      2      2 EPL      2014
## 3     4751 2014-08-16 15:00:00 Queens … Hull        0      1 EPL      2014
## 4     4752 2014-08-16 15:00:00 Stoke    Aston…      0      1 EPL      2014
## 5     4753 2014-08-16 15:00:00 West Br… Sunde…      2      2 EPL      2014
## 6     4754 2014-08-16 15:00:00 West Ham Totte…      0      1 EPL      2014
## # ... with 1 more variable: shots <list>

To get the information from the xG data into the Dixon-Coles model, I’m using the approach established in my previous post. This method uses individual shot xG values to estimate the probability of different scorelines occurring. Each scoreline can then be fed into the model as if it were an individual game, with each scoreline weighted by its probability of occurring.

Previously, I simulated the games via Monte Carlo. However, Marek Kwiatowski helpfully pointed out that, using this method, each team’s goals will follow a Poisson-Binomial distribution. This means that we can estimate the scoreline probabilities analytically. Using the Poisson-Binomial distribution is also faster than Monte-Carlo simulation.

Handily, functions for working with the Poisson-Binomial distribution are available in the poisbinom R package, released to CRAN last year.

add_if_missing <- function(data, col, fill = 0.0) {
# Add column if not found in a dataframe
# We need this in cases where a team has 0 shots (!)
if (!(col %in% colnames(data))) {
data[, col] <- fill
}
data
}
team_goal_probs <- function(xgs, side) {
# Find P(Goals=G) from a set of xGs by poisson-binomial dist.
# Use tidyeval to prefix column names with the team's side
# prefixes: "h"ome / "a"way
n_shots <- length(xgs)
tibble(!!str_c(side, "goals") := 0:n_shots,
!!str_c(side, "prob") := poisbinom::dpoisbinom(0:n_shots, xgs))
}
simulate_game <- function(shot_xgs) {
shot_xgs %>%
split(.$side) %>%
imap(~ team_goal_probs(.x$xg, .y)) %>%
reduce(crossing) %>%
# If there are no shots, give that team a 1.0 chance
# of scoring 0 goals
add_if_missing("hgoals", 0) %>%
add_if_missing("hprob", 1) %>%
add_if_missing("agoals", 0) %>%
add_if_missing("aprob", 1) %>%
mutate(prob = hprob * aprob) %>%
select(hgoals, agoals, prob)
}
simulated_games <-
games %>%
mutate(simulated_probabilities = map(shots, simulate_game)) %>%
select(match_id, home, away, simulated_probabilities) %>%
unnest() %>%
filter(prob > 0.0001) # Keep the number of rows vaguely reasonable
head(simulated_games)
## # A tibble: 6 x 6
##   match_id home              away    hgoals agoals    prob
##      <int> <chr>             <chr>    <int>  <dbl>   <dbl>
## 1     4749 Manchester United Swansea      0      0 0.165
## 2     4749 Manchester United Swansea      1      0 0.335
## 3     4749 Manchester United Swansea      2      0 0.184
## 4     4749 Manchester United Swansea      3      0 0.0516
## 5     4749 Manchester United Swansea      4      0 0.00894
## 6     4749 Manchester United Swansea      5      0 0.00104

Comparing the models

With the data required to fit both the Goals and xG models, we can get to work comparing them.

We can compare both the goals and xG models with a backtest. This means testing how well each model would have predicted games in the past. In other words, using only information available at that time, how well does the model perform over our historical data:

  • For each game…
    • Find preceding Premier League games within the last year
    • Then, for each model…
      • Fit the model on the preceding games
      • Make a prediction for that game
      • Calculate the prediction’s accuracy against the actual outcome
  • Finally, aggregate the total accuracy for each model over all games

No teams play twice in one same day. So we can actually fit the models for each day, rather than for each game. This has the advantage of being quicker to run but is otherwise equivalent to going game-by-game.

Find previous games

First, let’s find the previous games within a year of each game. We’ll use these games to fit a model as if we were at that point in time.

I’ve chosen a window of 1 year in the past for the models to fit on. This is a somewhat arbitrary choice; it seems likely that the historical window can be tweaked to improve the performance of the models. In other words, the models may perform better when fitted on games within 270 days of the last fixture, rather than 365.

However, that is a slightly different analysis. I also suspect that the optimal time window for the xG-based model and the Goals model will be different.

find_preceding_games <- function(game_date,
all_games = games,
period = lubridate::years(1)) {
all_games %>%
filter(date < game_date,
date > (game_date - period)) %>%
select(match_id) %>%
mutate(game_date = game_date)
}
window_length <- lubridate::years(1)
match_lookup <-
games$date %>%
lubridate::as_date() %>%
unique() %>%
map_dfr(find_preceding_games, period = window_length) %>%
group_by(game_date) %>%
summarise(matches = list(match_id)) %>%
ungroup() %>%
filter(game_date > (min(game_date) + window_length))
head(match_lookup)
## # A tibble: 6 x 2
##   game_date  matches
##   <date>     <list>
## 1 2015-08-22 <int [390]>
## 2 2015-08-23 <int [396]>
## 3 2015-08-24 <int [393]>
## 4 2015-08-29 <int [390]>
## 5 2015-08-30 <int [398]>
## 6 2015-09-12 <int [390]>

Fitting each model

I’m using the regista R package to fit the models. This is not available on CRAN but is freely available on Github, and can be installed in R like so:

install.packages("devtools")
devtools::install_github("torvaney/regista")

The models I’ll be evaluating are as follows:

  • Dixon-Coles
    • Vanilla Dixon-Coles model using only goals to estimate team strength
  • Dixon-Coles xG
    • Dixon-Coles model using xG values (via simulation)
  • Dixon-Coles xG (rho)
    • Dixon-Coles model using xG values, with the rho dependence parameter taken from the vanilla Dixon-Coles model.

The reasoning and methododology behind the models is explained in a bit more detail here.

library(regista)
fit_model <- function(match_ids, weights, all_games = games) {
all_games %>%
factor_teams(c("home", "away")) %>%
filter(match_id %in% match_ids) %>%
dixoncoles(
hgoal = hgoals,
agoal = agoals,
hteam = home,
ateam = away,
weights = weights,
data = .
)
}
transplant_param <- function(model1, model2) {
model2$par["rho"] <- model1$par["rho"]
model2
}
models <-
match_lookup %>%
mutate(
# Use non-syntactic names in anticipation of `gather`
`Dixon-Coles` = map(matches, fit_model, weights = 1),
`Dixon-Coles xG` = map(matches, fit_model, weights = quo(prob), all_games = simulated_games),
`Dixon-Coles xG (rho)` = map2(`Dixon-Coles`, `Dixon-Coles xG`, transplant_param)
) %>%
gather(model, fitted, -game_date, -matches)
head(models)
## # A tibble: 6 x 4
##   game_date  matches     model       fitted
##   <date>     <list>      <chr>       <list>
## 1 2015-08-22 <int [390]> Dixon-Coles <S3: dixoncoles>
## 2 2015-08-23 <int [396]> Dixon-Coles <S3: dixoncoles>
## 3 2015-08-24 <int [393]> Dixon-Coles <S3: dixoncoles>
## 4 2015-08-29 <int [390]> Dixon-Coles <S3: dixoncoles>
## 5 2015-08-30 <int [398]> Dixon-Coles <S3: dixoncoles>
## 6 2015-09-12 <int [390]> Dixon-Coles <S3: dixoncoles>

Making predictions

Make predictions for each date with each model. While there are different types of predictions we could make about a football match, I’m sticking to outcome (Home/Draw/Away).

This is by no means the best way to evaluate a soccer model; however it has a couple of advantages in this case. One is that it’s relatively easy to understand. Another is that public H/D/A predictions and closing odds are available online, which makes the model predictions easier to benchmark.

model_predictions <-
models %>%
mutate(predictions = map2(fitted, game_date, function(f, d) {
newdata <-
games %>%
factor_teams(c("home", "away")) %>%
filter(lubridate::as_date(date) == d) %>%
mutate(prob = 1)
newdata %>%
mutate(pred = map(predict(f, newdata, type = "scorelines"), scorelines_to_outcomes)) %>%
select(match_id, pred) %>%
unnest()
}))
head(model_predictions)
## # A tibble: 6 x 5
##   game_date  matches     model       fitted           predictions
##   <date>     <list>      <chr>       <list>           <list>
## 1 2015-08-22 <int [390]> Dixon-Coles <S3: dixoncoles> <tibble [18 × 3]>
## 2 2015-08-23 <int [396]> Dixon-Coles <S3: dixoncoles> <tibble [9 × 3]>
## 3 2015-08-24 <int [393]> Dixon-Coles <S3: dixoncoles> <tibble [3 × 3]>
## 4 2015-08-29 <int [390]> Dixon-Coles <S3: dixoncoles> <tibble [24 × 3]>
## 5 2015-08-30 <int [398]> Dixon-Coles <S3: dixoncoles> <tibble [6 × 3]>
## 6 2015-09-12 <int [390]> Dixon-Coles <S3: dixoncoles> <tibble [21 × 3]>

Evaluating the models

We can evaluate the models’ predictions using the log loss metric.

dc_log_loss <-
model_predictions %>%
select(model, predictions) %>%
unnest() %>%
left_join(games, by = "match_id") %>%
mutate(
obs_outcome = case_when(
hgoals > agoals ~ "home_win",
agoals > hgoals ~ "away_win",
hgoals == agoals ~ "draw"
),
log_loss = ifelse(outcome == obs_outcome, -log(prob), -log(1 - prob))
) %>%
group_by(model) %>%
summarise(log_loss = mean(log_loss))
head(dc_log_loss)
## # A tibble: 3 x 2
##   model                log_loss
##   <chr>                   <dbl>
## 1 Dixon-Coles             0.594
## 2 Dixon-Coles xG          0.573
## 3 Dixon-Coles xG (rho)    0.574

Of course, these numbers don’t mean much on their own. What does a 0.02 difference in log loss actually mean?

To put these into context, I’ve calculated the log loss for a few benchmark models. I haven’t shown the benchmarks inline, but the code to calculate them is available here

  • Benchmark
    • Assume all teams are the same strength and predict outcomes in line with historical frequencies (approximately H = 45%, D = 25%, A = 30%)
  • Closing odds
    • Implied probabilities from Pinnacle closing odds (from football-data.co.uk). You’re probably not going to get too close to these with public models/data.
  • Market-ratings
    • Team strength estimates derived from historical closing odds. An explanation of this method and links to code can be found here

Comparing the predictions

bind_rows(
dc_log_loss,
market,
benchmark,
marketratings
) %>%
ggplot(aes(x = reorder(model, log_loss), y = log_loss)) +
geom_point(size = 3) +
coord_flip() +
labs(title = "Average log loss",
subtitle = "Premier League 14/15 to 17/18",
x = NULL,
y = NULL) +
theme_minimal()

Comparing the predictions we see that the increase in predictive accuracy we get from using xG over goals is similar to the difference between using goals scored/conceded vs no team strength information at all.

However, this increase in predictive accuracy applies to computers, not humans. And if we go back to the initial claim that this post is supposed to be about, the implication is clearly that xG provides real value to humans trying to understand the game.

I think this is more of an open question; real people watching a game of football can pay attention to more than just the score. However, most fans, pundits, and analysts can’t watch every game in the season. In those cases, xG provides real and significant value over just looking at results.

While people can’t watch every game, and we still have people suggesting that “the table doesn’t lie”, I think there’s room for xG (or similar) to provide helpful insight. How much extra value it provides, may depend on how well you can incorporate information beyond results.