Modelling the World Cup with regista

Recently, I started working on an R package for soccer analytics amd modelling, regista, and the upcoming World Cup seemed like aa good an opportunity as any to demo what regista does and how it can be used.

So, in this post, I'm going to go through using regista to estimate the strengths of international teams.

About regista

At the moment, functionality is limited to estimating team strengths. However, I'm working to expand and extend the package. If you have any ideas on things you'd like to see, file an issue on github or drop me a message on twitter.

regista is not currently available on CRAN but can be downloaded from github like so:

if (!require(devtools)) {
install.packages("devtools")
}
devtools::install_github("torvaney/regista")

The data

Mart Jürisoo has handily compiled some international soccer data over on Kaggle. It's free to download (you need a kaggle account but that's free as well) and goes back all the way to 1872.

library(tidyverse)

games <-
read_csv(here::here("data/kaggle_games.csv")) %>%
mutate(hfa = !neutral) # Whether to apply Home Field Advantage on a given game

# Preview the data (5 most recent games)
games %>%
select(date, home_team, away_team, home_score, away_score, hfa) %>%
top_n(5, date)

## # A tibble: 15 x 6
##    date       home_team   away_team   home_score away_score hfa
##    <date>     <chr>       <chr>            <int>      <int> <lgl>
##  1 2018-06-02 El Salvador Honduras             1          0 FALSE
##  2 2018-06-02 Thailand    China                0          2 TRUE
##  3 2018-06-02 England     Nigeria              2          1 TRUE
##  4 2018-06-02 Montenegro  Slovenia             0          2 TRUE
##  5 2018-06-02 Mexico      Scotland             1          0 TRUE
##  6 2018-06-02 Sweden      Denmark              0          0 TRUE
##  7 2018-06-02 Ireland     USA                  2          1 TRUE
##  8 2018-06-02 Belgium     Portugal             0          0 TRUE
##  9 2018-06-02 Austria     Germany              2          1 TRUE
## 10 2018-06-02 Niger       Uganda               2          1 TRUE
## 11 2018-06-02 Iceland     Norway               2          3 TRUE
## 12 2018-06-02 Zambia      Namibia              0          0 FALSE
## 13 2018-06-02 Lesotho     Swaziland            1          0 FALSE
## 14 2018-06-02 Latvia      Estonia              1          0 TRUE
## 15 2018-06-02 Kenya       New Zealand          2          1 FALSE

Of course, we don't want to use all the data from 1872. Instead we'll use matches from the past two years.

Two years may seem like a long time, but unfortunately there just aren't that many games in international football. Using two years worth of games allows us to build a passable number of games for most major international teams.

library(lubridate)
library(regista)

recent_games <-
games %>%
filter(date > lubridate::as_date("2016-06-01"))

We don't want to estimate strengths for teams with a really small number of games played, either. We only want teams with enough games played to get a decent estimate of their strength. For this reason, I've chosen to exclude teams with fewer than 20 games played.

teams <-
bind_rows(recent_games %>% select(team = home_team),
recent_games %>% select(team = away_team)) %>%
count(team) %>%
filter(n >= 20) %>%
.$team

filtered_games <-
recent_games %>%
filter(home_team %in% teams,
away_team %in% teams) %>%
# We can use a regista function to convert teamnames to factors
# that share the same levels:
factor_teams(c("home_team", "away_team"))

str_glue("Number of teams: {length(teams)}
Number of games: {nrow(filtered_games)}")

## Number of teams: 80
## Number of games: 615

The model

Then, we have to fit the model. At the moment, regista implements the Dixon-Coles model. The original paper explaining the model is available online.

I recommend reading the paper to fully understand what's going on under the hood. However, there are many online resources outlining the model available via search if the explanation provided doesn't work for you.

The most important thing to understand about the model for now is that each team's attack and defence are estimated separately (along with home advantage). This allows a variety of teams to be modelled more accurately.

For instance, Atlético Madrid and Tottenham Hotspur might be similarly good teams, but Atlético are more defensive (score fewer goals and concede fewer goals). The Dixon-Coles model allows these differences to be accounted for.

With that in mind, let's fit the model for the international teams over the past 2 years.

We're going to use the dixoncoles_ext function to fit the model, which allows more flexible model specification with formulae. This is necessary because of the variable home advantage in this dataset. However, a simpler version (dixoncoles) is available too.

# Use regista to fit the model
res <- dixoncoles_ext(
home_score ~ off(home_team) + def(away_team) + hfa + 0,
away_score ~ off(away_team) + def(home_team) + 0,
weights = 1,
data = filtered_games
)

# Parse the team parameter estimates with broom
estimates <-
broom::tidy(res) %>%
filter(parameter != "rho",
parameter != "hfa") %>%
mutate(value = exp(value))

Team strength

You can think of the resulting parameter estimates as the expected number of goals scored or conceded versus the average team in the dataset.

For example, we would expect a team with an attack parameter of 1.4 to score 1.4 goals (on average) versus the average defence (a defence parameter of 1).

So, what do the estimates look like?

The following plot shows the attack and defence estimates. Higher (top) attack parameters and lower (left) defence paameters are better.

estimates %>%
spread(parameter, value) %>%
ggplot(aes(x = def, y = off)) +
geom_hline(yintercept = 1, linetype = "dotted") +
geom_vline(xintercept = 1, linetype = "dotted") +
geom_label(aes(label = team)) +
scale_x_continuous(limits = c(0, 2)) +
theme_minimal() +
labs(title = "International team ratings",
subtitle = str_glue("From {today() - years(2)} to {today()}"),
x = "Defence",
y = "Attack")

There seems to be a rough correlation between attack and defence, which is encouraging; we broadly expect good attacking teams to be good defending teams as well. However, the output is still a bit noisy. Peru in particular stand out.

I think part of this is down to the data. International teams don't play each other that often, and when they do, they tend to play more games against teams in the same region. Ideally we want more 'cross-pollination' when estimating team strength.

Home advantage

We've also estimated a home advantage parameter. This is a simple multiplicative factor on goalscoring. In other words, teams score hfa times as many goals when playing at home:

hfa <- exp(res$par[["hfa"]])
str_glue("Home field advantage: ~{round(hfa, 2)}")

## Home field advantage: ~1.33

Problems and improvements

There are a few issues that it's worth highlighting here:

  • Extra time and penalties
    • The dataset used doesn't discriminate between 90 minute games and 120 minute games. This is clearly an issue; model assumes all games are the same length.
  • Time discounting
    • Games in the distant past are clearly not as relevant to estimating team strength as games played yesterday. In fact, a method for incorporating this is included in the original Dixon-Coles paper. The ability to add custom weighting based on time (or other factors) is available with regista::dixoncoles's weight parameter. One application of the game weighting is demonstrated here.
  • Friendly weighting
    • Friendly games should probably not be weighted as highly as competitive games. Again, this is perfectly feasible with regista, but has not been performed here for the sake of simplicity.

Wrap-up

I hope this has been interesting and/or useful. The regista package is in early development and will continue to improve. Any and all contributions are welcome, from bug reports to pull requests.

Thanks for reading.