- You may need to run
`install.packages`

for some of these:

`riskfactors`

data- riskfactors: The Behavioral Risk Factor Surveillance System (BRFSS) Survey Data, 2009.
- The data is a subset of the 2009 survey from BRFSS, an ongoing data collection program designed to measure behavioral risk factors for the adult population (18 years of age or older) living in households.
- Codebook: https://www.rdocumentation.org/packages/naniar/versions/0.6.0/topics/riskfactors

```
data(riskfactors) # load data
riskfactors <- riskfactors %>% janitor::clean_names()
```

- Explore your data by adding the data.frame to the following commands:
- Try:
`dplyr::glipmse()`

- Try:
`skimr::skim_without_charts()`

- Try:
`naniar::miss_var_summary()`

- Try:
`naniar::gg_miss_var()`

- Try:
`naniar::vis_miss()`

- Try:
`naniar::gg_miss_upset()`

- Try:
`geom_miss_point()`

with code like the following:

- Try:

```
riskfactors %>%
ggplot() +
aes(x = height_inch, y = weight_lbs) +
geom_miss_point()
```

- How can
`geom_miss_point`

plot missing data? - Looking at your
`geom_miss_point`

plot, what are the approximate the*x*and*y*lower boundaries dividing NA values from rest of data?

- Which functions did you like best? Why?
- How much missingness is there? A lot, a little?
- Which variables have issues of missingness?

- Try
`gg_miss_fct(x = riskfactors, fct = marital)`

- See other options: https://naniar.njtierney.com/articles/naniar-visualisation.html

- Create a new column
`weight_imp`

and impute height with`naniar::impute_mean()`

- Visually compare
`weight_imp`

and`weight_lbs`

- Now create a column
`weight_num`

and a missingness dummy`weight_NA`

`simputation`

prefers*not*to work with integers so lets first create a new column`weight_num`

```
riskfactors <- riskfactors %>%
mutate(
weight_num = weight_lbs %>% as.numeric(),
weight_NA = ifelse(is.na(weight_lbs), "NA", "!NA")
)
```

- Try
`simputation::impute_median`

by a factor:

- Now impute a median
`weight_num`

as a function of a categorical variable like`marital`

```
risk_imp <- riskfactors %>%
simputation::impute_median(weight_num ~ marital)
```

- Visually review your imputed data with something like:

```
risk_imp %>%
ggplot() +
aes(y = weight_num, x = marital, color = weight_NA) +
geom_jitter()
```

- Try
`simputation::impute_lm`

:

- Now impute
`weight_num`

as a function of age and sex - See intro to
`simputation`

: https://cran.r-project.org/web/packages/simputation/vignettes/intro.html

```
risk_imp <- riskfactors %>%
simputation::impute_lm(weight_num ~ ...)
```

- View your results with someting like:

- Visually evaluate the imputation of
`weight_num`

- What’s the difference between
`impute_lm`

and`lm`

? - Try other possible models for
`weight_num`

(that is, add additional controls that might be useful for predicting weight)

- Try to impute
`health_poor`

with`impute_lm`

- What variables might be good predictors?

- For now focus on
`height_inch`

,`weight_lbs`

and`health_poor`

. Subset the data to these three variables with something like:

```
risk2 <- riskfactors %>%
select(weight_lbs, height_inch, health_poor, weight_NA)
```

- Also, importantly,
`Amelia`

is an older package and doesn’t like the modern form a data.frame called a`tibble`

so we need to convert it back to a plain old data.frame

```
risk2 <- risk2 %>% as.data.frame()
```

- Looking at our subsetted data, what are reasonable upper and lower bounds for those variables?

```
summary(risk2$weight_lbs)
```

- Complete the bounds_matrix for
`height_inch`

and`health_poor`

- What do negative values of
`health_poor`

mean? See codebook: https://www.rdocumentation.org/packages/naniar/versions/0.6.0/topics/riskfactors - How should you adjust your bounds for
`health_poor`

?

- Impute with
`Amelia`

```
# create a vector of variables for Amelia to ignore (including height and health_poor)
ignore_vars <- c("weight_NA")
# impute missing data
risk_amelia_out <- amelia(
x = risk2, # data set
m = 5, # # of imputations, usually 5, in pol346 1 ok
bounds = bounds_matrix, # tell amelia to bound some vars
idvars = ignore_vars # vars to leave out of imputation
)
```

- Extract first imputation

```
## amelia uses multiple imputation and creates 5 imputations
## code below extracts imputation 1
risk_imp1 <- risk_amelia_out$imputations$imp1
dim(risk_imp1)
```

- Evaluate the multiple imputation

- Check visually
- Though we didn’t impute it, why might we be concerned about imputing a variable like
`pregnant`

?

- https://naniar.njtierney.com
- https://naniar.njtierney.com/articles/naniar-visualisation.html
- https://rmisstastic.netlify.app/posts/lecture/
- https://www.youtube.com/watch?v=z8IuuDe5oXs&t=1816s
- https://www.ias.edu/video/MissingDataWS/2020/0908-RodLittle
- https://rmisstastic.netlify.app
- https://cran.r-project.org/web/views/MissingData.html