Volleyball R package: `volleystat`

Posted Feb 2, 2024 Updated Feb 3, 2024

By Liam Drowley 9 min read

Open Source Data: Overview

Datasets and variables

The volleystat package is an open-source statistics repository for the Men & Female German Bundesliga Volleyball League (1st Division) created by Viktor Bozhinov. The package contain the following datasets/dataframes, including: matches, matchstats, match_addresses, players, sets, staff, and team_addresses.

For example, here are some of the variables included in the players and sets dataframes:

Player data:

season_id	league_gender	team_id	team_name	lastname	firstname	height	gender	birthdate	shirt_number	position	nationality	player_id
1617	Men	1001	Solingen Volleys	Bevers	Lennart	179	male	1993-07-10	10	Libero	Germany	10
1617	Men	1001	Solingen Volleys	den Boer	Huib	191	male	1981-11-11	2	Setter	Netherlands	2
1617	Men	1001	Solingen Volleys	Gies	Oliver	197	male	1985-06-28	3	Outside spiker	Germany	3
1617	Men	1001	Solingen Volleys	Gosmann	Christian	191	male	1992-07-02	6	Universal	Germany	6
1617	Men	1001	Solingen Volleys	Horn	Maximilian	188	male	1996-08-07	4	Setter	Germany	4

Set data:

league_gender	season_id	match_id	match	team_id	team_name	set	set_duration	pt_set
Women	1314	2003	home	2001	Allianz MTV Stuttgart	1	23	18
Women	1314	2002	home	2002	Dresdner SC	1	23	25
Women	1314	2004	home	2008	USC Münster	1	27	22
Women	1314	2001	home	2005	Rote Raben Vilsbiburg	1	27	26
Women	1314	2005	home	2009	VC Wiesbaden	1	19	25

Data source

The data was collected from the official Bundesliga Volleyball website. It covers the games played from the 2013/14 to the 2018/19 season.

Specfically, the in-game match statistics were collected/scraped directly from official game reports using the tabulizer package:

Example game report:

Accessing the data

Accessing the data is quite straightforward, as the datasets are directly accessible from the volleystat package itself. Therefore you can either access the dataset directly:

  
volleystat::matches
volleystat::matchstats
volleystat::players
# and so on...

Unfortunately, one limitation of this package is that the data is scraped directly from the game reports (.pdf), therefore, the package will only be updated with new data once the package creator manually scrapes recent games and uploads them.

Possible research questions

As the volleystat package contains multiple variables, there are multiple research questions that can be produced, for example:

Is there a relationship between longer match/set times and player metrics?
- Using player metrics from the matchstats dataset and match_duration/set_duration from the matches/sets datasets respectively, we can discover any correlations for specific player metrics.
What player metrics best predict match outcome?
- We can create a new variable, match outcome, which can be used as a dependent variable in a predictive model using player metrics from matchstats as our predictor variables.
Does playing at home, time of match and spectator count influence match outcome?
- Variables for home or away, match time and spectator count can be extracted from the matches dataset, which can then be compared to a match outcome variable (created via data manipulation).

Unfortunately, the volleystat package does not include hawkeye-like data containing x,y co-ordinates of the ball, nor any LPS/Optical tracking data for player position on the court.

Not having access to this data makes it difficult/impossible to analyse parts of the games such as:

setter choice/selection
spike choice/selection
service choice/selection
defence formation on attack

Visualisations

This section will demonstrate what types of visualisations can be created through manipulation of data obtained from the volleystat package (all visualisations were made with the package ggplot2). The code for initial data wrangling performed can be viewed below.

Code: initial data wrangling

  
# Load required libraries
library(volleystat)
library(tidyverse)
library(kableExtra)
library(ggbump)
library(ggsci)

# Join matchstats & team_adresses from volleystat to know which team played which games
match_team_join <- left_join(volleystat::matchstats, 
                             volleystat::team_adresses, 
                             by = c('team_id' = 'team_id', 
                                    'season_id' = 'season_id')
) %>% 
  select(-c(league_gender.y, gym_adress, max_spectators, lon, lat))

# Calculate team statistics for each team grouped by season
team_summary_stats <- match_team_join %>%
  group_by(team_name, season_id, league_gender.x) %>% 
  summarise(pt_total = sum(pt_tot), # not using summarise(across) as I want to rename columns
            serv_total = sum(serv_tot),
            aces = sum(serv_pt),
            serv_error = sum(serv_err),
            rec_total = sum(rec_tot),
            rec_error = sum(rec_err),
            att_total = sum(att_tot),
            att_error = sum(att_err),
            att_block = sum(att_blo),
            att_pt = sum(att_pt),
            block_pt = sum(blo_pt)) %>% 
  mutate(team_name_abrv = abbreviate(team_name), # create abbreviated team_names for data viz
         season_id = case_when(season_id == 1314 ~ '2013/14',
                               season_id == 1415 ~ '2014/15',
                               season_id == 1516 ~ '2015/16',
                               season_id == 1617 ~ '2016/17',
                               season_id == 1718 ~ '2017/18',
                               season_id == 1819 ~ '2018/19'))

## `summarise()` has grouped output by 'team_name', 'season_id'. You can override
## using the `.groups` argument.

  
         #season_id = str_replace(season_id, '(\\d)(\\d{2}$)', '\\1/\\2')) # season xxyy to xx/yy

# Convert team_summary_stats to long format for data viz
team_stats_long <- team_summary_stats %>% 
  pivot_longer(cols = pt_total:block_pt,
               names_to = 'pt_metrics')

# Calculating season averages for each metric
season_avg_long <- team_stats_long %>% 
  group_by(season_id, pt_metrics, league_gender.x) %>% 
  summarise(avg = round(mean(value)))

## `summarise()` has grouped output by 'season_id', 'pt_metrics'. You can override
## using the `.groups` argument.

Additionally to add some context, a table of metric definitions has been provided below.

Metrics	Descriptions
pt_total	Total points scored
serv_total	Total number of serves
aces	Total number of service aces
serv_error	Total number of service errors
rec_total	Total number of receptions from serve
rec_error	Total number of receptions errors
att_total	Total number of attacks/spikes
att_error	Total number of attacking errors
att_block	Total number of blocks
att_pt	Total number of successful attacks
block_pt	Total number of successful blocks

Men’s League

Here is a visualisation that looks at the top 3 and bottom 3 performing teams of the Men’s Bundeliga 18/19 Volleyball Competition.

Code: additional data wrangling

  
top3_bottom3_1819_men <- team_stats_long %>% 
  filter(league_gender.x == 'Men',
         team_name %in% c('Berlin Recycling Volleys',
                          'VfB Friedrichshafen',
                          'Hypo Tirol AlpenVolleys Haching',
                          'VCO Berlin',
                          'TV Rottenburg',
                          'Helios Grizzlys Giesen'),
         season_id =='2018/19') %>% 
  # Add final seed to name for legend
  mutate(team_name = case_when(team_name == 'Berlin Recycling Volleys' ~ 'Berlin Recycling Volleys (1)',
                               team_name == 'VfB Friedrichshafen' ~ 'VfB Friedrichshafen (2)',
                               team_name == 'Hypo Tirol AlpenVolleys Haching' ~ 'Hypo Tirol AlpenVolleys Haching (3)',
                               team_name == 'Helios Grizzlys Giesen' ~ 'Helios Grizzlys Giesen (10)',
                               team_name == 'TV Rottenburg' ~ 'TV Rottenburg (11)',
                               team_name == 'VCO Berlin' ~ 'VCO Berlin (12)'),
         # Re-level for ordering in legend, weird ordering as legend is at the bottom
         team_name = factor(team_name, levels = c('Berlin Recycling Volleys (1)',
                                                  'Helios Grizzlys Giesen (10)',
                                                  'VfB Friedrichshafen (2)',
                                                  'TV Rottenburg (11)',
                                                  'Hypo Tirol AlpenVolleys Haching (3)',
                                                  'VCO Berlin (12)')),
         # Also re-level the abbreviated team names for the x-axis
         team_name_abrv = factor(team_name_abrv, levels = c('BrRV',
                                                            'VfBF',
                                                            'HTAH',
                                                            'HlGG',
                                                            'TVRt',
                                                            'VCOB'))) 

Code: data visualisations

  
ggplot(top3_bottom3_1819_men, aes(x = team_name_abrv, y = value, fill = team_name)) +
  geom_col() +
  facet_wrap(~pt_metrics, scales = 'free_y') +
  geom_hline(data = season_avg_long %>% filter(league_gender.x == 'Men', season_id == '2018/19'),
             aes(yintercept = avg, linetype = 'League Average'), linewidth = 0.7, col = '#54D421') +
  theme_bw() +
  theme(panel.grid.major.x = element_blank(),
        legend.position = 'bottom',
        legend.text = element_text(size = 11),
        legend.title = element_text(size = 13),
        axis.text = element_text(colour = 'black'),
        panel.background = element_rect(fill = '#f1f1f1')) +
  labs(fill = 'Team',
       linetype = '') + # change legend title
  ggtitle('Comparison of top 3 and bottom 3 teams for the 18/19 season, w/ league avg.') +
  ylab('Quantity') +
  xlab('') +
  scale_fill_manual(values = c("#375681", "#db571a", "#4972ab", "#E76E36", "#7092c2", "#EC895B")) +
  scale_linetype_manual(values = 5)

For nearly all of the metrics, the top 3 performing teams were above the season average, while the bottom performing teams were for the most part under the league average, which is mostly to be expected. This visual might also give some insight to coaches on the importance of some metrics compared to others. For instance seeing a possible correlation between aces and serv_error where the top 3 teams are taking more risks serving, hence the higher serv_error rates, but also have a higher number of aces. This may be something the bottom 3 teams could incoroporate into their game to try and be more successful.

Women’s League

I thought I would change it up for the women’s league and explore the service metrics across all the seasons.

Code: additional data wrangling

  
serve_metrics_13to19 <- team_stats_long %>% 
  filter(league_gender.x == 'Women',
         pt_metrics %in% c('serv_total', 'aces', 'serv_error'),
         team_name_abrv %in% c('AMTS', # filter to only have teams across all seasons
                               'DrSC',
                               'LiBA',
                               'RtRV',
                               'SCPt',
                               'SSCS',
                               'USCM',
                               'VCWs',
                               'VIIT'))

Code: data visualisations

  
ggplot(serve_metrics_13to19,aes(x = season_id, y = value, col = team_name)) +
  geom_bump(aes(group = team_name), linewidth = 0.7) + # have to add group = team_name when x-axis categorical
  geom_point(size = 1) +
  facet_wrap(~pt_metrics, scales = 'free_y') +
  theme_bw() +
  theme(panel.grid.minor.x = element_blank(),
        legend.position = 'bottom',
        legend.text = element_text(size = 11),
        legend.title = element_text(size = 13),
        axis.text = element_text(colour = 'black'),
        panel.background = element_rect(fill = '#f1f1f1')) +
  labs(col = 'Team') +
  ggtitle('Service metric comparison for Womens Bundesliga from 2013/14 to 2018/19 ') +
  ylab('Quantity') +
  xlab('Season') +
  scale_color_d3()

As for the women’s visualisation, all the teams are quite varied over the various seasons, which is somewhat expected as player transfers are a thing. To add a little more context I’ve also added a table below that shows where each team placed on the final ladder per season (only decided to keep teams in this graph in the table, which explains the missing values).

	Season
	2013/14	2014/15	2015/16	2016/17	2017/18	2018/19
1st	Dresdner SC	Dresdner SC	Dresdner SC	SSC Schwerin	Allianz MTV Stuttgart	Allianz MTV Stuttgart
2nd	VC Wiesbaden	Allianz MTV Stuttgart	SSC Schwerin	Allianz MTV Stuttgart	Dresdner SC	SSC Schwerin
3rd	Rote Raben Vilsbiburg	SSC Schwerin	Allianz MTV Stuttgart	Dresdner SC	SSC Schwerin	Dresdner SC
4th	Ladies in Black Aachen	VC Wiesbaden	USC Muenster	SC Potsdam	VC Wiesbaden	SC Potsdam
5th	SSC Schwerin	SC Potsdam	VC Wiesbaden	VC Wiesbaden	Ladies in Black Aachen	Rote Raben Vilsbiburg
6th	SC Potsdam	Ladies in Black Aachen	Rote Raben Vilsbiburg	Rote Raben Vilsbiburg	USC Muenster	Ladies in Black Aachen
7th	USC Muenster	USC Muenster	SC Potsdam	USC Muenster	SC Potsdam	USC Muenster
8th	-	Rote Raben Vilsbiburg	-	Ladies in Black Aachen	Rote Raben Vilsbiburg	VC Wiesbaden
9th	Allianz MTV Stuttgart	-	-	-	-	-
10th	-	-	Ladies in Black Aachen	-	-	-
11th	-	-	-	-	-	-
12th	-	-	-	-	-	-
13th	-	-	-	-	-	-

sport_analytics, volleyball

volleyball

This post is licensed under CC BY 4.0 by the author.