When recently looking at a data set with a lot of missing data I tried out a few different ways of quickly summarizing the missingness in the different variables. Here is a brief guide to the visualizations I found the most useful!
For this demonstration, we will borrow datasets from the package mice
.
library(mice)
library(tidyverse)
library(patchwork) # Combining plots
library(showtext) # Font
library(ggthemes) # Color palettes (I use the canva palettes here)
library(naniar) # Upset plots for missing values in ggplot2
library(eulerr) # Euler diagrams, proportional to size
library(ggforce) # Ellipses in ggplot
library(gt) # Great tables
library(gtExtras) # Extra stuff for the tables
library(scales) # For wrapping axis text
library(paletteer) # color palettes (I used a fish one)
Our first data set is a set of various sleep characteristics of 62 mammals.
data(mammalsleep)
data <- mammalsleep
better_names <- c(Species = "species", Body_weight = "bw", Brain_weight = "brw",
Slow_wave_sleep = "sws", Paradoxical_sleep = "ps",
Total_sleep = "ts", Maximum_life_span = "mls",
Gestation_time = "gt", Predation_index = "pi",
Sleep_exposure_index = "sei", Overall_danger_index = "odi")
data <- data %>% rename(all_of(better_names))
names(data) <- gsub("_", " ", names(data), fixed=TRUE)
# Font
f1 <- "Open Sans"
font_add_google(name = f1, family = f1)
showtext_auto()
# Colors
pal <- paletteer_d("fishualize::Halichoeres_radiatus")
missing_row.plot <- data %>%
mutate(id = row_number()) %>%
gather(-id, key = "key", value = "val") %>%
mutate(isna = is.na(val)) %>%
ggplot(aes(key, id, fill = isna)) +
geom_raster(alpha=0.82) +
scale_fill_manual(name = "",
values = pal[c(1,3)],
labels = c("Present", "Missing")) +
scale_x_discrete(position = "top", labels = wrap_format(10), expand = c(0,0)) +
scale_y_continuous(breaks = c(1, seq(10, nrow(data), by = 10)), expand = c(0,0), trans = "reverse") +
labs(x = "",
y = "Row Number") +
theme_bw() +
theme(legend.position = "top",
text = element_text(size = 15, family = f1),
panel.grid = element_blank())
missing_row.plot
Next, I just want a table that summarizes the number of missing observations in each variable. For this, I will first make a data frame with the counts and percentages of missing in each variable:
missing_count <- data %>% is.na %>% as.data.frame() %>%
map_int(sum) %>% as.data.frame()
missing_count$variable = rownames(missing_count)
missing_count <- missing_count %>% rename(count = ".") %>%
mutate(percent = 100*count/nrow(data)) %>%
relocate(variable)
Then we can easily make this into a nice table using the gt
package.
# Table -----
columns_with_missing <- missing_count %>%
filter(count > 0) %>%
dplyr::select(variable) %>%
as.matrix()
missing_table <- missing_count %>%
dplyr::filter(variable %in% columns_with_missing) %>% # select only columns that have missing data
rename(Variable = "variable", "# missing" = count, "% missing" = percent) %>% # rename columns
gt() %>% # table
gt_plt_bar_pct(column = "% missing", scaled = TRUE, fill = pal[4]) %>% # make the percentage column into a barchart
tab_style(style = cell_text( # change font
font = google_font(f1)),
locations = list(cells_column_labels(everything()),
cells_body(columns = c(1,2))))
missing_table
Variable | # missing | % missing |
---|---|---|
Slow wave sleep | 14 | |
Paradoxical sleep | 12 | |
Total sleep | 4 | |
Maximum life span | 4 | |
Gestation time | 4 |
The row-plot and table are both great for getting a quick overview of the data and the number of missing values. But especially with the table, we have no information about the interactions in the missingness, that is, are many of the missing values in the same row? We see this to some degree in the row-plot, but in this case we only have 62 observations. When the number of observations increases it becomes less clear when the missingness is in the same row. Toillustrate the overlaps in the missingness, I thought it would be illustrative with some kind of venn-diagram (I learned that the correct term for the type of plot that doesn’t show overlaps when the set is null is called an Euler diagram). I also wanted the size of the circles and overlaps to be proportional to the overlaps and number of missing observations. I found what I wanted in the package eulerr
. There is a built-in-way to plot the resulting Euler diagrams, but I wanted to do it with ggplot2
for a bit more freedom. It wasn’t too hard to extract the necessary numbers from the eulerr
object (with good help from this vignette), and for plotting the ellipses themselves I use the ggforce
package.
# Euler plot ------
euler_mat <- data %>% is.na() %>% as.data.frame() %>%
dplyr::select(columns_with_missing[1:5])
euler_fit <- euler(euler_mat)
ellipses <- euler_fit$ellipses %>% mutate(variable = rownames(euler_fit$ellipses))
missing_euler <- ggplot(ellipses) +
geom_ellipse(aes(x0 = h, y0 = k, a = a, b = b, angle = phi, fill = variable), alpha = 0.5) +
scale_fill_manual(values = pal) +
coord_fixed() +
theme_void() +
theme(legend.title = element_blank())
missing_euler
I really like the way this looks, but unfortunately it isn’t exact, especially when there are so few observations. For example, from this diagram it looks like there would be an observation that is missing the “Paradoxical sleep” measurement, but not the “Slow wave sleep”, due to the tiny un-overlapped sliver on the left. However, looking at the row-plot, we see that the set of animals missing values in “Paradoxical sleep”, is completely contained in the set of animals missing values in “Slow wave sleep”. The eulerr
object gives us an overview over both the true counts in each set and the fitted values, and these could also be plotted on top of the circles, but I won’t do this here since there are so many intersections. Unfortunately, though the idea is fun, I don’t think this visualization will work very well in many cases.
euler_fit$original.values[1:15]
Slow wave sleep
0
Paradoxical sleep
0
Total sleep
0
Maximum life span
2
Gestation time
3
Slow wave sleep&Paradoxical sleep
9
Slow wave sleep&Total sleep
2
Slow wave sleep&Maximum life span
0
Slow wave sleep&Gestation time
0
Paradoxical sleep&Total sleep
0
Paradoxical sleep&Maximum life span
0
Paradoxical sleep&Gestation time
0
Total sleep&Maximum life span
0
Total sleep&Gestation time
0
Maximum life span&Gestation time
1
There is another option that solves the problem of the impreciseness of the euler plot. I was first a little skeptical of this one just because I don’t think it is completely self-explanatory, and I think in most contexts, getting an overview of the missingness is something you want to do quick and dirty, and if visualizations are necessary you want them to be super intuitive. But this one is more precise than the Euler diagram and also shows the interactions, so to an audience that is already familiar with them (and maybe with some helpful annotations), I think it can be really useful. The plot is called an upset plot, and can be used as an alternative to Venn diagrams in other cases than just visualizing missingness. I use the implementation from the library naniar
(which has several other useful functions for these types of things!)
gg_miss_upset(data, sets.bar.color = pal[1], main.bar.color = pal[4])
The bars on the left show the total number of missing values for each of the variables, and the vertical bars show the numbers missing in each intersection. My main complaint here is that although the documentation says that it returns a ggplot
visualization, I don’t seem to be able to edit it using my typical ggplot
ways, to change the color I instead had to use the arguments from UpSetR::upset
.
For the project that motivated me to write this post, I made my own upset plot in ggplot2
, but it is kind of hard-coded and the code is specific to that data. The approach in itself wasn’t too complicated though, basically you just convert the data to a data frame of the same size, but with true/false values indicating whether each observation is missing or not. Then it takes some manipulation to make the counts for each of the sets and interactions, and then the plots themselves are just standard bar charts. I also just used a dot-plot for the table-part of it, and then combined them all using patchwork
(the package). If I’m able to generalize the code I will happily share it at a later point.
As another example, let’s look at this data set from mice
with self-reported height and weight data from two studies, containing 2060 observations. A description of the data can be found by ?mice::selfreport
.
data("selfreport")
data <- selfreport
# Colors
pal <- paletteer_d("fishualize::Lutjanus_sebae")
missing_row.plot <- data %>%
mutate(id = row_number()) %>%
gather(-id, key = "key", value = "val") %>%
mutate(isna = is.na(val)) %>%
ggplot(aes(key, id, fill = isna)) +
geom_raster(alpha=0.82) +
scale_fill_manual(name = "",
values = pal[c(1,4)],
labels = c("Present", "Missing")) +
scale_x_discrete(position = "top", labels = wrap_format(10), expand = c(0,0)) +
scale_y_continuous(breaks = c(1, seq(100, nrow(data), by = 100)), expand = c(0,0), trans = "reverse") +
labs(x = "",
y = "Row Number") +
theme_bw() +
theme(legend.position = "top",
text = element_text(size = 15, family = f1),
panel.grid = element_blank())
missing_row.plot
missing_count <- data %>% is.na %>% as.data.frame() %>%
map_int(sum) %>% as.data.frame()
missing_count$variable = rownames(missing_count)
missing_count <- missing_count %>% rename(count = ".") %>%
mutate(percent = 100*count/nrow(data)) %>%
relocate(variable)
# Table -----
columns_with_missing <- missing_count %>%
filter(count > 0) %>%
dplyr::select(variable) %>%
as.matrix()
missing_table <- missing_count %>%
dplyr::filter(variable %in% columns_with_missing) %>% # select only columns that have missing data
rename(Variable = "variable", "# missing" = count, "% missing" = percent) %>% # rename columns
gt() %>% # table
gt_plt_bar_pct(column = "% missing", scaled = TRUE, fill = pal[4]) %>% # make the percentage column into a barchart
tab_style(style = cell_text( # change font
font = google_font(f1)),
locations = list(cells_column_labels(everything()),
cells_body(columns = c(1,2))))
missing_table
Variable | # missing | % missing |
---|---|---|
hm | 803 | |
wm | 803 | |
prg | 1657 | |
edu | 1257 | |
etn | 1257 | |
bm | 803 |
# Euler plot ------
euler_mat <- data %>% is.na() %>% as.data.frame() %>%
dplyr::select(columns_with_missing[1:5])
euler_fit <- euler(euler_mat)
ellipses <- euler_fit$ellipses %>% mutate(variable = rownames(euler_fit$ellipses))
missing_euler <- ggplot(ellipses) +
geom_ellipse(aes(x0 = h, y0 = k, a = a, b = b, angle = phi, fill = variable), alpha = 0.5) +
scale_fill_manual(values = pal) +
coord_fixed() +
theme_void() +
theme(legend.title = element_blank())
missing_euler
(Actually, it looks like eulerr
just displays five sets, while we here have six variables with missing values, so this isn’t great.)
gg_miss_upset(data, sets.bar.color = pal[2], main.bar.color = pal[4])