---
title: "Creating Survey Design Objects"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Creating Survey Design Objects}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r}
#| label: setup
#| include: false

knitr::opts_chunk$set(
  collapse = TRUE,
  eval = FALSE,
  comment = "#>"
)
```

## Why Survey Weights Matter

NHANES uses a complex, multistage probability sampling design to select participants who represent the non-institutionalized U.S. population. Without proper survey weights, analyses will produce biased estimates. The `create_design()` function automates the calculation of appropriate weights when combining multiple NHANES cycles, following [CDC weighting guidelines](https://wwwn.cdc.gov/nchs/nhanes/tutorials/Weighting.aspx).

## Understanding NHANES Weights

NHANES provides three categories of sampling weights, each reflecting different levels of participation:

1. **Interview weights** (`wtint2yr`, `wtint4yr`): Used when all variables come from the household interview (demographics, questionnaires).
2. **Mobile Exam Center (MEC) weights** (`wtmec2yr`, `wtmec4yr`): Used when any variable requires a physical exam (laboratory tests, body measurements, DEXA scans).
3. **Fasting weights** (`wtsaf2yr`): Used when any variable requires fasting laboratory tests (glucose, insulin, lipids).

The probability of being sampled decreases from interview to MEC to fasting subsamples. When combining variables across categories, always use the weight with the lowest probability of selection. For example, if your analysis includes both demographics (interview) and body measurements (MEC), use MEC weights.

## Weight Calculation Logic

CDC recommendations for combining cycles are based on the number of cycles present in your data, not the timespan covered. This distinction matters when you have gaps in your data.

### Early Cycles (1999-2002)

NHANES provides 4-year weights (`wtint4yr`, `wtmec4yr`) for 1999-2000 and 2001-2002 cycles, while all subsequent cycles provide only 2-year weights. When combining multiple cycles:

* **Cycles 1999 or 2001**: Use 4-year weight × (2/n)
  The numerator is 2 because the 4-year weight represents two 2-year cycles.

* **Cycles 2003+**: Use 2-year weight × (1/n)

* **Denominator n**: Total number of cycles in your analysis

### Example Calculation

Combining 4 cycles (1999, 2001, 2003, 2005) with MEC weights:

* 1999 & 2001: `wtmec4yr * 2/4 = wtmec4yr * 0.5`
* 2003 & 2005: `wtmec2yr * 1/4 = wtmec2yr * 0.25`

If you excluded the 2003 cycle, you would have 3 cycles total, so:

* 1999 & 2001: `wtmec4yr * 2/3`
* 2005: `wtmec2yr * 1/3`

The key principle: **n is the number of cycles present**, not the timespan.

## Basic Usage

```{r}
#| label: load-packages
#| eval: true
#| message: false
#| warning: false

library(nhanesdata)
library(dplyr)
library(srvyr)
```

### Example 1: Interview Weights

When analyzing demographics and questionnaire data only:

```{r}
#| label: interview-example

# Load demographics data
demo <- read_nhanes("demo")

# Create design with interview weights
design_int <- create_design(
  dsn = demo,
  start_yr = 1999,
  end_yr = 2011,
  wt_type = "interview"
)

# Calculate weighted means
design_int |>
  summarize(
    mean_age = survey_mean(ridageyr, na.rm = TRUE),
    pct_female = survey_mean(riagendr == 2, na.rm = TRUE)
  )
```

### Example 2: MEC Weights

When including any examination or laboratory data:

```{r}
#| label: mec-example

# Load demographics and body measures
demo <- read_nhanes("demo")
bmx <- read_nhanes("bmx")

combined <- demo |>
  left_join(bmx, by = c("seqn", "year"))

# Use MEC weights because body measures require exam participation
design_mec <- create_design(
  dsn = combined,
  start_yr = 2007,
  end_yr = 2017,
  wt_type = "mec"
)

# Weighted BMI analysis
design_mec |>
  filter(!is.na(bmxbmi)) |>
  summarize(
    mean_bmi = survey_mean(bmxbmi, na.rm = TRUE),
    pct_obese = survey_mean(bmxbmi >= 30, na.rm = TRUE)
  )
```

### Example 3: Fasting Weights

When including fasting laboratory measurements:

```{r}
#| label: fasting-example

# Load demographics and fasting lab data
demo <- read_nhanes("demo")
glu <- read_nhanes("glu")

combined <- demo |>
  left_join(glu, by = c("seqn", "year"))

# Use fasting weights for glucose analysis
design_fast <- create_design(
  dsn = combined,
  start_yr = 2005,
  end_yr = 2015,
  wt_type = "fasting"
)

# Analyze fasting glucose
design_fast |>
  filter(!is.na(lbxglu)) |>
  summarize(
    mean_glucose = survey_mean(lbxglu, na.rm = TRUE)
  )
```

## Handling Edge Cases

### Non-Sequential Cycles

You can specify a wide year range even if some cycles are missing from your data. The function calculates weights based only on cycles actually present:

```{r}
#| label: gaps-example

# Data might be missing 2007-2010 cycles
# Weights calculated on cycles present, not timespan
design <- create_design(
  dsn = demo,
  start_yr = 1999,
  end_yr = 2017,
  wt_type = "interview"
)
```

### Participants Without Valid Weights

When creating a survey design, some participants may lack the weight variable needed for your analysis. This happens naturally in NHANES because not everyone completes every component.

**How `create_design()` handles this:**

* Participants without valid weights for your chosen weight type are **automatically filtered out** before creating the design object
* You'll see a message indicating how many participants were removed and why
* The message includes links to CDC guidance and this vignette for reference

Example message you might see:

```
Filtered out 150 participants without valid mec weights.
These participants were not in the subsample for this weight category.
Learn more:
  + CDC weighting guidance:
    https://wwwn.cdc.gov/nchs/nhanes/tutorials/Weighting.aspx
  + Survey design vignette: vignette('survey-design', package = 'nhanesdata')
```

**Zero weights** are different from missing weights:

* Participants with **zero weights** are retained in the design object
* These participants weren't selected for a particular subsample
* They're automatically excluded from analyses by the {survey} package
* This is the correct behavior per CDC guidelines

## Variance Estimation and Lonely PSUs

NHANES uses a stratified, multistage sampling design with Primary Sampling Units (PSUs) nested within strata. Variance estimation requires at least 2 PSUs per stratum. When subsetting data (e.g., filtering to diabetes patients only), you may create strata with only one PSU.

The `create_design()` function sets `options(survey.lonely.psu = "adjust")`, which handles this conservatively by centering single-PSU strata at the sample grand mean rather than the stratum mean. This approach:

* Prevents errors when encountering lonely PSUs
* Provides conservative variance estimates
* Follows best practices for subset analyses

For more details on lonely PSU handling, see Thomas Lumley's [{survey} package documentation](https://r-survey.r-forge.r-project.org/survey/exmample-lonely.html).

## Required Variables

The function validates that your dataset contains:

* `year`: NHANES cycle start year (odd years: 1999, 2001, 2003, ..., 2021)
* `sdmvpsu`: Primary sampling units
* `sdmvstra`: Sampling strata
* Appropriate weight variables based on `wt_type`:
  + Interview: `wtint2yr` (and `wtint4yr` if 1999/2001 cycles present)
  + MEC: `wtmec2yr` (and `wtmec4yr` if 1999/2001 cycles present)
  + Fasting: `wtsaf2yr`

These variables are automatically included in datasets loaded via `read_nhanes()`.

## Workflow Recommendations

1. **Load and combine datasets** using `read_nhanes()` and {dplyr} joins
2. **Preprocess variables** (recode, create derived variables, apply exclusions)
3. **Create the design object** with `create_design()`
4. **Perform weighted analyses** using {srvyr} or {survey} functions

Preprocessing before design creation is strongly recommended. Once the design object is created, filtering and recoding become more complex due to the survey structure.

## Additional Resources

* [CDC NHANES Tutorials](https://wwwn.cdc.gov/nchs/nhanes/tutorials/default.aspx): Official guidance on survey weighting and variance estimation
* [CDC Weighting Module](https://wwwn.cdc.gov/nchs/nhanes/tutorials/Weighting.aspx): Specific information on combining survey cycles
* [{survey} package documentation](https://r-survey.r-forge.r-project.org/survey/): Comprehensive guide to complex survey analysis in R
* [{srvyr} package](https://github.com/gergness/srvyr): Tidyverse-friendly wrapper for {survey} package functions