Examining the amount, nature, and patterns of missing data is crucial to properly handling missing data (McKnight et al., 2007; van Buuren, 2018; White et al., 2011). There are no hard and fast rules about how much missing data constitutes a serious problem, but more missing data usually means there is a greater risk of getting biased results if your analysis relies on listwise deletion to handle missing data. One has to make judgments about whether the amount and nature of the missing data issues in your study warrant the effort required to implement sophisticated solutions such as multiple imputation (MI) or full information maximum likelihood (FIML) estimation. It is best to make such judgments after a thorough assessment of the missingness. In short, you should diagnose the missing data situation before you decide whether and how to treat the problem.
This post discusses a few simple things one should do along the way and provides R code that I might use in a Quarto script to do them, along with commentary on the results obtained for an example data set.
In larger or more complex data sets this sort of assessment can be lengthy, so this post is by no means comprehensive. It’s just demonstrating some essential summaries that you should examine. For more extensive guidance on assessing missing data issues, see books by McKnight et al. (2007), Enders (2022), and Graham (2013).
Tip
All the example code operates on a data frame. Naturally, changing the data set fed into the code will change the results. Think carefully about which data set makes the most sense to use. You may need to subset a large data file by selecting the cases and variables relevant to your planned analyses before applying these techniques.
2 R Packages for Examining Missing Data
There are several R packages designed for dealing with missing data that have functions useful at the assessment stage. For example, ggmice, mice, naniar, and VIM are ones I happen to have installed. It’s worth exploring them to see what tools you like. Below, I have used ggmice and mice in my examples.
3 Setup
Next we load R packages that we need to get additional functions.
Code
```{r}#| label: load-packages#| message: falselibrary(devtools) # for session_info()library(rmarkdown) # for pandoc_version()library(knitr) # for kable()options(kableExtra.latex.load_packages =FALSE)library(kableExtra) # for kable_styling(), add_header_above(), etc.library(tidyverse) # loads dplyr, ggplot2, lubridate, stringr all at once.# for %>%, filter(), mutate(), etc.library(purrr) # for map()library(mice) # for mice(), nic(), ncc(), and the boys data set. library(modelsummary) # for modelsummary()library(ggmice) # for plot_pattern(), plot_corr()library(quarto) # for quarto_version()```
4 Example
For the example code below, we’ll use the boys data (Fredriks et al., 2000), which is bundled with the the mice package (Van Buuren & Groothuis-Oudshoorn, 2011). These data come from a developmental study about growth in Dutch boys. The variables include age, region, weight, height, body mass index, head circumference, and three measures of pubertal development. For more details about the dataset, use the following command in the R console after loading the mice package: ?boys.
4.1 Amount of Missing Data
You should always examine the dimensions of your data set in terms of numbers of cases (rows), variables (columns), and values (rows x columns). Extracting those values from a data set is easy using functions such as nrow() and ncol() as shown below. The number of values is obtained by multiplying the numbers of cases and variables. This sets the context for examining how much of the data is missing. Additional functions from the mice package make it easy to extract numbers of complete cases (ncc()) and incomplete cases nic(). Creative combined use of the sum(), colSums(), and is.na() functions can get us similar counts of the number of complete and incomplete variables and of the number of complete (non-missing) and incomplete (missing) values.
The chunk below generates Table 1, which shows a convenient summary of all those counts. I plan to eventually add functions to my piercer package to automate generating this sort of summary if I can’t find an existing package that has one I like. That would reduce the amount of code required to get the table.
Table 1: Numbers of Complete and Incomplete Cases, Variables, and Values
Cases
Variables
Values
Subset
n
%
n
%
n
%
Complete
223
29.8
1
11.1
5110
75.9
Incomplete
525
70.2
8
88.9
1622
24.1
All
748
100.0
9
100.0
6732
100.0
Tip
A value is just the datum for a specific case on a specific variable. It is complete (observed) when it non-missing, and incomplete when it is missing.
A case is complete when all values on the row (across variables) are complete; it becomes incomplete if any variable in that row has a missing value.
Similarly, a variable is complete when all values in that column (across cases) are complete but it becomes incomplete if any case in that column has a missing value.
Here, we can see that there’s a reasonably large sample size (748) but there are only 9 variables. That yields a total of 6732 values in the data set.
We see different percentages of completeness depending on whether we look at cases, variables, or values. This is a cross-sectional dataset, so we don’t have to consider a time dimension but if we had a longitudinal study we might need to examine that as an additional dimension in the assessment. Similarly, this is not a multilevel study design but if it were, we might need to examine missingness at each level of the design.
The percent of incomplete cases is staggeringly high (70.2%), as is the percent of incomplete variables (88.9%). However, we can also see that the percent of missing values in the full data frame is much lower (24.1%), though it may still be large enough to cause some concern. We need to know more before we make decisions.
4.2 Univariate Missingness
While Table 1 is a useful overview of the amount of missing data, we want to do a more granular examination as well. For example, we now know that there are 8 incomplete variables but the amount of missing data in each of those variables could be quite different. Maybe there’s only one missing value among hundreds of cases for one variable, but hundreds of missing values for another variable. That could be very important.
So, the next step is to examine univariate summaries of the amount of missing data for each variable. The chunk below constructs Table 2, which shows such a summary.
Table 2: Univariate Missingness in Boys Growth Data, (N = 748)
Position
Name
N_Valid
N_Missing
Pct_Missing
1
age
748
0
0.0
2
hgt
728
20
2.7
3
wgt
744
4
0.5
4
bmi
727
21
2.8
5
hc
702
46
6.1
6
gen
245
503
67.2
7
phb
245
503
67.2
8
tv
226
522
69.8
9
reg
745
3
0.4
Inspection of Table 2 shows that age is the only variable with no missing data. There is only a small amount of missing data for basic anthropometric measurements (hgt, wgt, bmi, hc) and region (reg), but there is a large amount of missing data for the three measures of pubertal development (gen, phb, and tv).
The maximum percent missing observed in Table 2 for a single variable can never be lower than the overall percent of incomplete cases from Table 1. However, the percent of incomplete cases can easily be higher than any of the percentages for individual variables because missing data may occur in different variables on different cases.
If our planned analyses only required the anthropometric variables and region, this data might be considered to have a fairly small amount of missing data. One could remove the irrelevant variables from the data set and then re-create Table 1 to get better estimates of how many incomplete cases would be dropped by listwise deletion. On the other hand, analyses requiring any of the pubertal development measures would require solving an extremely large missing data problem.
4.3 Multivariate Patterns of Missingness
After examining the univariate missingness, we need to shift our focus to considering the patterns of missing data across variables. These multivariate patterns of missingness can be quite useful in better understanding that nature of the missingness issues in your data. Figure 1 illustrates what I mean by missing data patterns. It is generated by the ggmice function plot_pattern(), which is a cleaner, better labeled alternative to the graphical output from the mice function md.pattern().
Code
```{r}#| label: fig-patterns#| fig-cap: "Missingness Patterns for Dutch Boys Growth Study Data (748 boys, 9 #| variables, 1 time point) [@Fredriks-RN8696]"plot_pattern(boys, square =FALSE, rotate =FALSE)```
Figure 1: Missingness Patterns for Dutch Boys Growth Study Data (748 boys, 9 variables, 1 time point) (Fredriks et al., 2000)
Each row in the graph represents a distinct pattern of observed versus missing values across the whole set of variables. The left side is annotated with the number of cases (i.e., boys) showing each pattern, the right side tells you how many missing values there are on that row, and the bottom shows the total number of missing values for each variable. The variables are sorted from left to right in ascending order by number of missing values, so those on the right have the most missing data. The annotation below the color coding legend shows the total number of missing values. Some of the information in this figure overlaps with that in the tables above, but I find it useful to have the percentages in those tables along with the counts shown here.
The first row of Figure 1 shows the 223 boys have complete data. The rest of the plot shows that this data set has a lot of holes in it. For example, we can see that the most common pattern is that 437 boys (58.4%) have missing data for all 3 pubertal development measures but observed data on all the other measures. We can also see a number of patterns exhibited by only one boy each. The patterns with the highest frequencies are probably the most useful ones to look at in this type of plot. Beware that data sets containing both large samples and large numbers of variables can make plots like this hard to read. You may have to get creative about using these tools with subsets of the data in such cases.
Another thing we can see here is that body mass index is always missing whenever either height or weight is missing. That’s construct missingness caused by item missingness because BMI is a measure derived from both the height and weight variables. Solving any missing data for height and weight appropriately will let you solve missing BMI data as well. The mice package has features designed to handle preserving such deterministic relationships between variables during imputation.
Inspect the patterns of missingness and try to make sense of them. That can reveal things like contiguous blocks of variables that are always missing together, which might be a result of things like participants skipping an entire section of a survey. In a longitudinal study where the data file is laid out in wide, multivariate format (repeated measures over time are represented by sets of columns), wave non-response or dropout might show up the same way.
4.4 Predictors of Missingness
Identifying predictors of missingness is another step in assessing the nature of the missing data. This stage of assessing missing data issues can become quite extensive if there are many variables. Simple bivariate analyses may suffice, but you can use multivariate methods too. Just make sure you use models appropriate to the types of variables (nominal, ordinal, interval, ratio, counts, etc.) being examined in any given analysis.
Observed variables that are associated with whether another variable is missing reveal that your data are incompatible with the missing completely at random (MCAR) mechanism; instead the mechanism must either be missing at random (MAR) or missing not at random (MNAR).
Suppose in this boys growth study, we wanted to check whether the pubertal development variables were more likely to be missing for the youngest boys. That might happen if data collection procedures specified assessing the pubertal variables only for boys above some threshold age. We could create a set of binary missingness indicators, one for each pubertal variable, coded 0 if the pubertal variable was missing and 1 if it was observed. Then a logistic regression model could test whether age predicts the missingness indicator well. Table 3 shows that the missingness of the tv variable decreases with each additional year in age. This is actionable information about a potential source of bias if missing data are not handled properly.
Code
```{r}#| label: tbl-m1#| tbl-cap: Logistic Regression Model Predicting Missingness of tv#| warning: falsem1 <-glm(is.na(tv) ~ age, family = binomial, data = boys)options(modelsummary_factory_html ='kableExtra')modelsummary(m1, exponentiate =TRUE, output ="kableExtra", align ="lrrrrr", escape =FALSE,col.names =c("Term", "OR", "OR.LL", "OR.UL", "t", "p"),shape = term ~ model + statistic,gof_omit ="BIC|AIC",statistic =c("conf.int", "statistic", "p.value"))```
Table 3: Logistic Regression Model Predicting Missingness of tv
(1)
Term
OR
OR.LL
OR.UL
t
p
(Intercept)
16.934
11.230
26.545
12.921
<0.001
age
0.832
0.806
0.858
−11.441
<0.001
Num.Obs.
748
Log.Lik.
−Inf
F
130.902
RMSE
0.42
Another thought for this example data is that maybe the rate of missingness for tv varies across regions. We can check that with a contingency table and a chi-square test. It does not look like region is associated with missingess of tv in these data.
Code
```{r}#| label: tbl-reg-tv#| tbl-cap: Contingency Table Region by Missingness of tvxtabs(~reg +is.na(tv), data = boys) %>%kable(col.names =c("Region", "FALSE (n)", "TRUE (n)")) %>%kable_styling() %>%add_header_above(., header =c(" ", "Missing tv"=2))xtabs(~reg +is.na(tv), data = boys) %>%chisq.test()```
Table 4: Contingency Table Region by Missingness of tv
Missing tv
Region
FALSE (n)
TRUE (n)
north
32
49
east
52
109
west
62
177
south
60
131
city
20
53
Put some thought into what to examine and how to do it. You may not be able to examine everything possible in a given data set, so be judicious about looking for potential predictors of missingness. Use substantive knowledge to guide those decisions if possible.
5 Extension to Multilevel Data
Suppose you have a multilevel data structure, such as students (level 1) nested within schools (level 2), with variables measured at each level of analysis. It would be prudent to assess missing data in each of the following data sets and think about what you learn from that before deciding how to handle the missing data.
5.1 Level 2 Data Set
Start by using a level 2 data set containing one row per school. Apply the techniques discussed above to understand missingness for school-level measures. How many schools have complete versus incomplete data? Missing school-level measures will affect all students from those schools after we merge the two levels of data.
5.2 Level 1 Data Set
Then, use a level 1 data set containing one row per student. This data set should only contain student-level measures so you can get a pure look at how student-level missing data issues might affect your analysis.
5.3 Combined Dataset
Merge the level 2 measures into the level 1 data set to get a combined data set with one row per student that contains both student- and school-level measures. Repeat the missing data analysis on this combined data to learn how much the intersection of missingness across levels affects the number of complete cases. Comparing the number of complete cases between the combined data set and the level 1 data set will reveal how much missing school-level measures will exacerbate any issues observed in the level 1 data. Some students may have complete data only when considering student-level measures, but have incomplete data after merging in school-level measures. That will have consequences for your analysis.
Fredriks, A. M., Buuren, S. van, Burgmeijer, R. J. F., Meulmeester, J. F., Beuker, R. J., Brugman, E., Roede, M. J., Verloove-Vanhorick, S. P., & Wit, J.-M. (2000). Continuing positive secular growth change in the netherlands 1955-1997. Pediatric Research, 47, 316–323. https://doi.org/10.1203/00006450-200003000-00006
Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067
7 Software Information
Software chain: qmd file > RStudio > Quarto > R > knitr > md file > Pandoc > html file.
Quarto 1.7.20 runs *.qmd files through R and knitr to produce *.md markdown files.
Pandoc 3.2 converts markdown files (*.md) to other formats, including LaTeX (*.tex) and HTML (*.html) among others.
This document was generated using the following computational environment and dependencies:
Code
```{r}#| label: show-version# Get R and R package version numbers in use.devtools::session_info()```
Warning in system2("quarto", "-V", stdout = TRUE, env = paste0("TMPDIR=", :
running command '"quarto"
TMPDIR=C:/Users/pierces1/AppData/Local/Temp/RtmpOq0WAT/file569815d235d9 -V' had
status 1
---title: "Assessing Missing Data"description: "Examples of Assessing Missing Data."date: "2025-03-29"date-modified: "2025-03-29"draft: falsecategories: - Quarto Tips - Missing Data - Rformat: html: toc: true toc-depth: 3 number-sections: true number-depth: 3 code-fold: true code-tools: true code-line-numbers: false embed-resources: true anchor-sections: true link-external-icon: trueexecute: eval: true echo: fenced output: true warning: true error: true include: true---# PurposeExamining the amount, nature, and patterns of missing data is crucial to properly handling missing data [@McKnight-RN1296; @van_Buuren-RN3962; @White-RN3603]. There are no hard andfast rules about how much missing data constitutes a serious problem, but moremissing data usually means there is a greater risk of getting biased results ifyour analysis relies on listwise deletion to handle missing data. One has tomake judgments about whether the amount and nature of the missing data issues inyour study warrant the effort required to implement sophisticated solutions suchas multiple imputation (MI) or full information maximum likelihood (FIML)estimation. It is best to make such judgments after a thorough assessment of themissingness. In short, you should diagnose the missing data situation before you decide whether and how to treat the problem.This post discusses a few simple things one should do along the way and providesR code that I might use in a Quarto script to do them, along with commentary on the results obtained for an example data set. In larger or more complex data sets this sort of assessment can be lengthy,so this post is by no means comprehensive. It's just demonstrating some essential summaries that you should examine. For more extensive guidance on assessing missing data issues, see books by @McKnight-RN1296, @Enders-RN8673, and @Graham-RN2677.:::{.callout-tip}All the example code operates on a data frame. Naturally, changing the data set fed into the code will change the results. Think carefully about which data set makes the most sense to use. You may need to subset a large data file by selecting the cases and variables relevant to your planned analyses before applying these techniques. :::# R Packages for Examining Missing DataThere are several R packages designed for dealing with missing data that have functions useful at the assessment stage. For example, `ggmice`, `mice`, `naniar`, and `VIM` are ones I happen to have installed. It's worth exploring them to see what tools you like. Below, I have used `ggmice` and `mice` in my examples.# Setup Next we load R packages that we need to get additional functions. ``` {r}#| label: load-packages#| message: falselibrary(devtools) # for session_info()library(rmarkdown) # for pandoc_version()library(knitr) # for kable()options(kableExtra.latex.load_packages =FALSE)library(kableExtra) # for kable_styling(), add_header_above(), etc.library(tidyverse) # loads dplyr, ggplot2, lubridate, stringr all at once.# for %>%, filter(), mutate(), etc.library(purrr) # for map()library(mice) # for mice(), nic(), ncc(), and the boys data set. library(modelsummary) # for modelsummary()library(ggmice) # for plot_pattern(), plot_corr()library(quarto) # for quarto_version()```# Example For the example code below, we'll use the `boys` data [@Fredriks-RN8696], whichis bundled with the the `mice` package [@van_Buuren-RN3208]. These data comefrom a developmental study about growth in Dutch boys. The variables includeage, region, weight, height, body mass index, head circumference, and threemeasures of pubertal development. For more details about the dataset, use thefollowing command in the R console after loading the `mice` package: `?boys`.## Amount of Missing DataYou should always examine the dimensions of your data set in terms of numbers of cases (rows), variables (columns), and values (rows x columns). Extracting those values from a data set is easy using functions such as `nrow()` and `ncol()` as shown below. The number of values is obtained by multiplying the numbers of cases and variables. This sets the context for examining how much of the data ismissing. Additional functions from the `mice` package make it easy to extract numbers of complete cases (`ncc()`) and incomplete cases `nic()`. Creative combined use of the `sum()`, `colSums()`, and `is.na()` functions can get us similar counts of the number of complete and incomplete variables and of the number of complete (non-missing) and incomplete (missing) values. The chunk below generates @tbl-dataset-info, which shows a convenient summary of all those counts. I plan to eventually add functions to my [`piercer`](https://github.com/sjpierce/piercer) package to automate generating this sort of summary if I can't find an existing package that has one I like. That would reduce the amount of code required to get the table. ```{r}#| label: tbl-dataset-info#| tbl-cap: Numbers of Complete and Incomplete Cases, Variables, and ValuesN_cases <-nrow(boys) # All casesN_ccases <-ncc(boys) # Complete casesN_icases <-nic(boys) # Incomplete cases N_vars <-ncol(boys) # All variablesN_cvars <-sum(colSums(is.na(boys)) ==0) # Complete variablesN_ivars <-sum(colSums(is.na(boys)) >0) # Incomplete variablesN_vals <- N_cases*N_vars # All valuesN_cvals <-sum(!is.na(boys)) # Complete values (non-missing)N_ivals <-sum(is.na(boys)) # Incomplete values (missing)tibble(Subset =c("Complete", "Incomplete", "All"),Cases =c(N_ccases, N_icases, N_cases),Cases_P =100*Cases/N_cases, Variables =c(N_cvars, N_ivars, N_vars),Variables_P =100*Variables/N_vars,Values =c(N_cvals, N_ivals, N_vals),Values_P =100*Values/N_vals) %>%kable(., format ="html", digits =1, col.names =c("Subset", rep(c("n", "%"), times =3))) %>%kable_styling() %>%add_header_above(., header =c(" ", "Cases"=2, "Variables"=2, "Values"=2))```:::{.callout-tip}* A value is just the datum for a specific case on a specific variable. It is complete (observed) when it non-missing, and incomplete when it is missing. * A case is complete when all values on the row (across variables) are complete; it becomes incomplete if any variable in that row has a missing value. * Similarly, a variable is complete when all values in that column (across cases) are complete but it becomes incomplete if any case in that column has a missing value. ::: Here, we can see that there's a reasonably large sample size (`r N_cases`) but there are only `r N_vars` variables. That yields a total of `r nrow(boys)*ncol(boys)` values in the data set. We see different percentages of completeness depending on whether we look at cases, variables, or values. This is a cross-sectional dataset, so we don't have to consider a time dimension but if we had a longitudinal study we might need to examine that as an additional dimension in the assessment. Similarly, this is not a multilevel study design but if it were, we might need to examine missingness at each level of the design. The percent of incomplete cases is staggeringly high (`r round(100*N_icases/N_cases, digits = 1)`%), as is the percent of incomplete variables (`r round(100*N_ivars/N_vars, digits = 1)`%). However, we can also see that the percent of missing values in the full data frame is much lower (`r round(100*N_ivals/N_vals, digits = 1)`%), though it may still be large enough to cause some concern. We need to know more before we make decisions. ## Univariate MissingnessWhile @tbl-dataset-info is a useful overview of the amount of missing data, we want to do a more granular examination as well. For example, we now know that there are `r N_ivars` incomplete variables but the amount of missing data in each of those variables could be quite different. Maybe there's only one missing value among hundreds of cases for one variable, but hundreds of missing values for another variable. That could be very important. So, the next step is to examine univariate summaries of the amount of missing data for each variable. The chunk below constructs @tbl-missing-by-var, which shows such a summary. ```{r}#| label: tbl-missing-by-var#| tbl-cap: !expr paste0("Univariate Missingness in Boys Growth Data, (N = ", #| nrow(boys), ")")boys %>%list() %>%map_dfr(~tibble(Name =names(.x),N_Valid =colSums(!is.na(.x)),N_Missing =colSums(is.na(.x)),Pct_Missing =100*(N_Missing/nrow(boys)))) %>%rowid_to_column(., "Position") %>%kable(., format ="html", digits =1) %>%kable_styling() ```Inspection of @tbl-missing-by-var shows that `age` is the only variable with no missing data. There is only a small amount of missing data for basic anthropometric measurements (`hgt`, `wgt`, `bmi`, `hc`) and region (`reg`), but there is a large amount of missing data for the three measures of pubertal development (`gen`, `phb`, and `tv`). The maximum percent missing observed in @tbl-missing-by-var for a single variable can never be lower than the overall percent of incomplete cases from @tbl-dataset-info. However, the percent of incomplete cases can easily be higher than any of the percentages for individual variables because missing data may occur in different variables on different cases. If our planned analyses only required the anthropometric variables and region, this data might be considered to have a fairly small amount of missing data. One could remove the irrelevant variables from the data set and then re-create @tbl-dataset-info to get better estimates of how many incomplete cases would be dropped by listwise deletion. On the other hand, analyses requiring any of the pubertal development measures would require solving an extremely large missing data problem. ## Multivariate Patterns of MissingnessAfter examining the univariate missingness, we need to shift our focus to considering the patterns of missing data across variables. These multivariate patterns of missingness can be quite useful in better understanding that nature of the missingness issues in your data. @fig-patterns illustrates what I mean by missing data patterns. It is generated by the `ggmice` function `plot_pattern()`, which is a cleaner, better labeled alternative to the graphical output from the `mice` function `md.pattern()`. ```{r}#| label: fig-patterns#| fig-cap: "Missingness Patterns for Dutch Boys Growth Study Data (748 boys, 9 #| variables, 1 time point) [@Fredriks-RN8696]"plot_pattern(boys, square =FALSE, rotate =FALSE)```Each row in the graph represents a distinct pattern of observed versus missingvalues across the whole set of variables. The left side is annotated with thenumber of cases (i.e., boys) showing each pattern, the right side tells you howmany missing values there are on that row, and the bottom shows the total numberof missing values for each variable. The variables are sorted from left to right in ascending order by number of missing values, so those on the right have the most missing data. The annotation below the color coding legend shows the total number of missing values. Some of the information in this figure overlaps with that in the tables above, but I find it useful to have the percentages in those tables along with the counts shown here. The first row of @fig-patterns shows the `r N_ccases` boys have complete data. The rest of the plot shows that this data set has a lot of holes in it. For example, we can see that the most common pattern is that 437 boys (`r round(100*437/N_cases, digits = 1)`%) have missing data for all 3 pubertal development measures but observed data on all the other measures. We can also see a number of patterns exhibited by only one boy each. The patterns with the highest frequencies are probably the most useful ones to look at in this type of plot. Beware that data sets containing both large samples and large numbers of variables can make plots like this hard to read. You may have to get creative about using these tools with subsets of the data in such cases. Another thing we can see here is that body mass index is *always* missing whenever either height or weight is missing. That's construct missingness caused by item missingness because BMI is a measure derived from both the height and weight variables. Solving any missing data for height and weight appropriately will let you solve missing BMI data as well. The `mice` package has features designed to handle preserving such deterministic relationships between variables during imputation. Inspect the patterns of missingness and try to make sense of them. That canreveal things like contiguous blocks of variables that are always missingtogether, which might be a result of things like participants skipping an entiresection of a survey. In a longitudinal study where the data file is laid out inwide, multivariate format (repeated measures over time are represented by setsof columns), wave non-response or dropout might show up the same way.## Predictors of MissingnessIdentifying predictors of missingness is another step in assessing the nature ofthe missing data. This stage of assessing missing data issues can become quiteextensive if there are many variables. Simple bivariate analyses may suffice,but you can use multivariate methods too. Just make sure you use modelsappropriate to the types of variables (nominal, ordinal, interval, ratio,counts, etc.) being examined in any given analysis.Observed variables that are associated with whether another variable is missingreveal that your data are incompatible with the missing completely at random(MCAR) mechanism; instead the mechanism must either be missing at random (MAR) or missing not at random (MNAR). Suppose in this boys growth study, we wanted to check whether the pubertaldevelopment variables were more likely to be missing for the youngest boys. That might happen if data collection procedures specified assessing the pubertal variables only for boys above some threshold age. We could create a set ofbinary missingness indicators, one for each pubertal variable, coded 0 if thepubertal variable was missing and 1 if it was observed. Then a logistic regression model could test whether age predicts the missingness indicator well. @tbl-m1 shows that the missingness of the `tv` variable decreases with each additional year in age. This is actionable information about a potential source of bias if missing data are not handled properly. ```{r}#| label: tbl-m1#| tbl-cap: Logistic Regression Model Predicting Missingness of tv#| warning: falsem1 <-glm(is.na(tv) ~ age, family = binomial, data = boys)options(modelsummary_factory_html ='kableExtra')modelsummary(m1, exponentiate =TRUE, output ="kableExtra", align ="lrrrrr", escape =FALSE,col.names =c("Term", "OR", "OR.LL", "OR.UL", "t", "p"),shape = term ~ model + statistic,gof_omit ="BIC|AIC",statistic =c("conf.int", "statistic", "p.value"))```Another thought for this example data is that maybe the rate of missingness for`tv` varies across regions. We can check that with a contingency table and a chi-square test. It does not look like region is associated with missingess of `tv` in these data.```{r}#| label: tbl-reg-tv#| tbl-cap: Contingency Table Region by Missingness of tvxtabs(~reg +is.na(tv), data = boys) %>%kable(col.names =c("Region", "FALSE (n)", "TRUE (n)")) %>%kable_styling() %>%add_header_above(., header =c(" ", "Missing tv"=2))xtabs(~reg +is.na(tv), data = boys) %>%chisq.test()```Put some thought into what to examine and how to do it. You may not be able to examine everything possible in a given data set, so be judicious about looking for potential predictors of missingness. Use substantive knowledge to guide those decisions if possible.# Extension to Multilevel DataSuppose you have a multilevel data structure, such as students (level 1) nested within schools (level 2), with variables measured at each level of analysis. It would be prudent to assess missing data in each of the following data sets and think about what you learn from that before deciding how to handle the missing data. ## Level 2 Data SetStart by using a level 2 data set containing one row per school. Apply the techniques discussed above to understand missingness for school-level measures.How many schools have complete versus incomplete data? Missing school-level measures will affect all students from those schools after we merge the two levels of data. ## Level 1 Data SetThen, use a level 1 data set containing one row per student. This data set should only contain student-level measures so you can get a pure look at how student-level missing data issues might affect your analysis. ## Combined DatasetMerge the level 2 measures into the level 1 data set to get a combined data set with one row per student that contains both student- and school-level measures.Repeat the missing data analysis on this combined data to learn how much the intersection of missingness across levels affects the number of complete cases.Comparing the number of complete cases between the combined data set and the level 1 data set will reveal how much missing school-level measures will exacerbate any issues observed in the level 1 data. Some students may have complete data only when considering student-level measures, but have incomplete data after merging in school-level measures. That will have consequences for your analysis. # References::: {#refs}:::# Software Information- Software chain: **qmd file > RStudio > Quarto > R > knitr > md file > Pandoc > html file**.- [Quarto `r quarto_version()`](https://quarto.org/) runs `*.qmd` files through [R](https://www.r-project.org/) and [knitr](https://yihui.org/knitr/) to produce `*.md` markdown files.- [Pandoc `r rmarkdown::pandoc_version()`](https://pandoc.org) converts markdown files (`*.md`) to other formats, including LaTeX (`*.tex`) and HTML (`*.html`) among others.This document was generated using the following computational environment and dependencies: ``` {r}#| label: show-version# Get R and R package version numbers in use.devtools::session_info()```