Introduction

Survey microdata often comes with one or more columns of “weights,” calculated variables indicating how many people (or households) each respondent to the survey is supposed to represent.

The logic of using weights is very simple: you multiply each respondent by its weight. For example, imagine a made-up two-person survey asking if people preferred Coke or Pepsi, where Person A said “Coke” and Person B “Pepsi.” In an unrepresentative survey, we’d say the results were 50%-50%. But say this were representative, and based on demographics we said Person A represented 150 people and Person B 100 people. Now instead of 1 vote for Coke and 1 vote for Pepsi, we have 150 votes for Coke and 100 for Pepsi, or 55%.

Moving back into reality, we can look at available survey microdata of the sort produced by the American Community Survey, the Current Population Survey, and others. All these datasets, available from IPUMS, include weighting variables.

You can use Excel or online tools to handle this kind of weighted survey microdata. But especially for some of these larger datasets, command-line tools like R are powerful — and surprisingly easy to use.

Here’s how to work with survey data in R.

Setup

The only package we’ll need for this is the tidyverse:

library(tidyverse)

For this lesson, we’re going to work with data from the U.S. Energy Information Administration’s invaluable Residential Energy Consumption Survey (RECS) about a topic of recent interest: gas stoves.

A subset of this data is in your project file in R’s hyper-efficient RDS format, and you can load it in with a single line of code:

load("recs.rds")

Take a look at the data, which is 1,368 responses from the 2020 RECS — just those responses from Minnesota, Montana, North Dakota, South Dakota and Wisconsin.

There’s lots of data here, but I want to focus on just a few columns:

recs %>% select(state_name, state_postal, RANGEFUEL, RANGEINDT, NWEIGHT)
## # A tibble: 1,368 × 5
##    state_name   state_postal RANGEFUEL RANGEINDT NWEIGHT
##    <chr>        <chr>            <dbl>     <dbl>   <dbl>
##  1 Minnesota    MN                  -2        -2   6151.
##  2 South Dakota SD                   5         0   1998.
##  3 Minnesota    MN                   5         0   5349.
##  4 Minnesota    MN                  -2        -2   7120.
##  5 North Dakota ND                  13        -2   1156.
##  6 North Dakota ND                   5         0   1327.
##  7 North Dakota ND                   5         0   1508.
##  8 Wisconsin    WI                   5         0   6823.
##  9 Wisconsin    WI                   1        -2   6217.
## 10 Wisconsin    WI                   1        -2   5363.
## # … with 1,358 more rows

This involves a few different variables to get at the question we really want to answer: what kind of stove does a house have? So let’s make a new variable to condense all this down. Run the following code:

recs <- recs %>% 
    mutate(range_type = case_when(RANGE == 0 ~ "None",
                                  RANGEFUEL == 1 ~ "Gas",
                                  RANGEFUEL == 2 ~ "Propane",
                                  RANGEFUEL == 13 ~ "Dual-fuel",
                                  RANGEFUEL == 5 & RANGEINDT == 0 ~ "Electric",
                                  RANGEFUEL == 5 & RANGEINDT == 1 ~ "Induction",
                                  TRUE ~ NA_character_) %>%
            fct_relevel("Electric", "Gas", "Propane", "Dual-fuel", "Induction", "None"), # Put our range types in order
           .after = "NWEIGHT") # Where the new column should be placed; without this it would default to the end

Basic weighting

Now that we’ve got our data loaded, let’s see how to handle survey weights in R. It’s pretty easy!

recs %>% # Start with your data
    group_by(range_type) %>% # Group by the variables you to analyze
    summarize(weight = sum(NWEIGHT)) %>% # Sum by your weighting variable
    mutate(pct = weight / sum(weight)) # Convert to percents
## # A tibble: 6 × 3
##   range_type   weight    pct
##   <fct>         <dbl>  <dbl>
## 1 Electric   3527825. 0.617 
## 2 Gas        1307520. 0.229 
## 3 Propane     248785. 0.0435
## 4 Dual-fuel   195092. 0.0341
## 5 Induction    78702. 0.0138
## 6 None        356062. 0.0623

Now you can see that about 3.5 million households, or 61.7 percent, have electric ranges, while 1.3 million households or 22.8 percent have gas stoves.

A simple graph could show this more clearly:

recs %>% # Start with your data
    group_by(range_type) %>% # Group by the variables you to analyze
    summarize(weight = sum(NWEIGHT)) %>% # Sum by your weighting variable
    mutate(pct = weight / sum(weight)) %>% # Convert to percents
    ggplot(aes(x = range_type, y = pct)) + # Create a graph with range_type and pct as our axis variables
    geom_col() + # Make it a bar graph
    coord_flip() # Flip the chart on its side

But often we’re interested in more than one variable! Fortunately, this trick scales up as much as you want. You just group by as many different variables as you want to analyze, then sum your weighting variable.

Here’s our data, broken out by state:

recs %>% # Start with your data
    group_by(state_name, range_type) %>% # Group by the variables you to analyze
    summarize(weight = sum(NWEIGHT)) %>% # Sum by your weighting variable
    mutate(pct = weight / sum(weight)) # Convert to percents
## `summarise()` has grouped output by 'state_name'. You can override using the
## `.groups` argument.
## # A tibble: 30 × 4
## # Groups:   state_name [5]
##    state_name range_type   weight    pct
##    <chr>      <fct>         <dbl>  <dbl>
##  1 Minnesota  Electric   1305204. 0.585 
##  2 Minnesota  Gas         480900. 0.216 
##  3 Minnesota  Propane     108876. 0.0488
##  4 Minnesota  Dual-fuel    79992. 0.0359
##  5 Minnesota  Induction    37763. 0.0169
##  6 Minnesota  None        217048. 0.0973
##  7 Montana    Electric    286868. 0.662 
##  8 Montana    Gas          89263. 0.206 
##  9 Montana    Propane      13279. 0.0306
## 10 Montana    Dual-fuel    12061. 0.0278
## # … with 20 more rows
recs %>% # Start with your data
    group_by(state_name, range_type) %>% # Group by the variables you to analyze
    summarize(weight = sum(NWEIGHT)) %>% # Sum by your weighting variable
    mutate(pct = weight / sum(weight)) %>% # Convert to percents
    ggplot(aes(x = range_type, y = pct)) + # Create a graph with range_type and pct as our axis variables
    geom_col() + # Make it a bar graph
    coord_flip() + # Flip the chart on its side
    facet_wrap(vars(state_name))
## `summarise()` has grouped output by 'state_name'. You can override using the
## `.groups` argument.

You can see that gas stoves are particularly rare in North Dakota, among other potentially interesting conclusions.

A note on group order.

One thing to be aware of: in R, when you group_by() and then summarize(), it automatically removes the right-most group from your data. Here’s what this means:

recs %>% 
    group_by(range_type) %>% # Group by ange_type
    summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost (and only) grouping variable
    mutate(pct = weight / sum(weight)) 
## # A tibble: 6 × 3
##   range_type   weight    pct
##   <fct>         <dbl>  <dbl>
## 1 Electric   3527825. 0.617 
## 2 Gas        1307520. 0.229 
## 3 Propane     248785. 0.0435
## 4 Dual-fuel   195092. 0.0341
## 5 Induction    78702. 0.0138
## 6 None        356062. 0.0623

In the above code, running mutate(pct = weight / sum(weight)) produced the percentages we expected because the data was *no longer grouped by range_type.

If the data were still grouped, it would divide weight by sum(weight) for each group, unhelpfully producing 1s!

recs %>% 
    group_by(range_type) %>% 
    summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost (and only) grouping variable
    group_by(range_type) %>% # Re-adding our grouping variable
    mutate(pct = weight / sum(weight)) 
## # A tibble: 6 × 3
## # Groups:   range_type [6]
##   range_type   weight   pct
##   <fct>         <dbl> <dbl>
## 1 Electric   3527825.     1
## 2 Gas        1307520.     1
## 3 Propane     248785.     1
## 4 Dual-fuel   195092.     1
## 5 Induction    78702.     1
## 6 None        356062.     1

If you have multiple grouping variables, it removes the right-most. When we group our data by state_name and range_type, that means our percents will be calculated within each state as the remaining grouped variable.

recs %>% 
    group_by(state_name, range_type) %>% 
    summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost grouping variable
    mutate(pct = weight / sum(weight)) 
## `summarise()` has grouped output by 'state_name'. You can override using the
## `.groups` argument.
## # A tibble: 30 × 4
## # Groups:   state_name [5]
##    state_name range_type   weight    pct
##    <chr>      <fct>         <dbl>  <dbl>
##  1 Minnesota  Electric   1305204. 0.585 
##  2 Minnesota  Gas         480900. 0.216 
##  3 Minnesota  Propane     108876. 0.0488
##  4 Minnesota  Dual-fuel    79992. 0.0359
##  5 Minnesota  Induction    37763. 0.0169
##  6 Minnesota  None        217048. 0.0973
##  7 Montana    Electric    286868. 0.662 
##  8 Montana    Gas          89263. 0.206 
##  9 Montana    Propane      13279. 0.0306
## 10 Montana    Dual-fuel    12061. 0.0278
## # … with 20 more rows

When the data’s been grouped like this, each state’s values should add up to 1:

0.585350262 + 0.215671046 + 0.048828105 + 0.035874385 + 0.016935786 + 0.097340417
## [1] 1

When in doubt, manually apply the grouping you want!

Continuous variables

Sometimes we have a continuous variable in our survey responses that we want to analyze. But you can’t group by a continuous variable. Similarly, sometimes a categorical variable includes way more detail than we want.

What we need to do is create a new categorical variable that condenses our data into discrete chunks.

For example, perhaps we were interested in whether older homes are more or less likely to have gas stoves than newer homes. The RECS has a variable, YEARMADERANGE, with very detailed buckets: “2” indicates 1950-59, “3” is 1960-69, and so on. In this case, we could check simply homes with a YEARMADERANGE of “1,” meaning homes built before 1950:

recs %>%
    select(state_name, range_type, YEARMADERANGE, NWEIGHT) %>%
    mutate(is_old = case_when(YEARMADERANGE == 1 ~ "Old", TRUE ~ "New"))
## # A tibble: 1,368 × 5
##    state_name   range_type YEARMADERANGE NWEIGHT is_old
##    <chr>        <fct>              <dbl>   <dbl> <chr> 
##  1 Minnesota    None                   2   6151. New   
##  2 South Dakota Electric               6   1998. New   
##  3 Minnesota    Electric               7   5349. New   
##  4 Minnesota    None                   1   7120. Old   
##  5 North Dakota Dual-fuel              8   1156. New   
##  6 North Dakota Electric               3   1327. New   
##  7 North Dakota Electric               4   1508. New   
##  8 Wisconsin    Electric               2   6823. New   
##  9 Wisconsin    Gas                    4   6217. New   
## 10 Wisconsin    Gas                    7   5363. New   
## # … with 1,358 more rows

The case_when() function has a little bit of an odd syntax. Here’s how that works, using a ~ to separate a statistical test on the left from the output if true on the right:

data %>%
    mutate(varname = case_when(test1 ~ output1,
                               test2 ~ output2,
                               ...,
                               TRUE ~ output_in_all_other_cases))

Going back to our data, we can now calculate a result:

recs %>%
    mutate(is_old = case_when(YEARMADERANGE == 1 ~ "Old", TRUE ~ "New")) %>%
    group_by(is_old, range_type) %>%
    summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost grouping variable
    mutate(pct = weight / sum(weight)) 
## `summarise()` has grouped output by 'is_old'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 4
## # Groups:   is_old [2]
##    is_old range_type   weight     pct
##    <chr>  <fct>         <dbl>   <dbl>
##  1 New    Electric   2915334. 0.640  
##  2 New    Gas         930354. 0.204  
##  3 New    Propane     208843. 0.0459 
##  4 New    Dual-fuel   127675. 0.0280 
##  5 New    Induction    68114. 0.0150 
##  6 New    None        303930. 0.0667 
##  7 Old    Electric    612490. 0.528  
##  8 Old    Gas         377166. 0.325  
##  9 Old    Propane      39941. 0.0344 
## 10 Old    Dual-fuel    67417. 0.0581 
## 11 Old    Induction    10589. 0.00913
## 12 Old    None         52132. 0.0450

Here we can see that gas stoves are actually more common in older homes than newer ones!

Multiple variables

So far we’ve worked with one or two grouping variables. We can go further!

Let’s stick with our home-age variable, and add in another variable about home type:

recs %>%
    mutate(is_old = case_when(YEARMADERANGE == 1 ~ "Old", TRUE ~ "New")) %>%
    mutate(home_type = case_when(TYPEHUQ == 1 ~ "Mobile home",
                                 TYPEHUQ == 2 ~ "Single-family home",
                                 TYPEHUQ == 3 ~ "Townhome/duplex",
                                 TYPEHUQ %in% 4:5 ~ "Apartment") %>% as_factor()) %>%
    group_by(is_old, home_type, range_type) %>%
    summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost grouping variable
    mutate(pct = weight / sum(weight)) 
## `summarise()` has grouped output by 'is_old', 'home_type'. You can override
## using the `.groups` argument.
## # A tibble: 33 × 5
## # Groups:   is_old, home_type [7]
##    is_old home_type          range_type   weight    pct
##    <chr>  <fct>              <fct>         <dbl>  <dbl>
##  1 New    Single-family home Electric   1604148. 0.553 
##  2 New    Single-family home Gas         736487. 0.254 
##  3 New    Single-family home Propane     157628. 0.0544
##  4 New    Single-family home Dual-fuel    75266. 0.0260
##  5 New    Single-family home Induction    39115. 0.0135
##  6 New    Single-family home None        286884. 0.0989
##  7 New    Apartment          Electric   1045174. 0.868 
##  8 New    Apartment          Gas          89308. 0.0742
##  9 New    Apartment          Dual-fuel    34190. 0.0284
## 10 New    Apartment          Induction    24312. 0.0202
## # … with 23 more rows

Remembering our grouping, we can see that this output is grouped by whether the home was built before 1950, and what type of home it is. So our percentages for old apartments add up to 1; so do our percentages for new apartments, for old single-family homes, etc.

But this is a bit hectic and hard to visualize. So let’s simplify this down: compare single-family homes and apartments by their prevalence of gas stoves:

recs %>%
    mutate(is_old = case_when(YEARMADERANGE == 1 ~ "Old", TRUE ~ "New")) %>%
    mutate(home_type = case_when(TYPEHUQ == 1 ~ "Mobile home",
                                 TYPEHUQ == 2 ~ "Single-family home",
                                 TYPEHUQ == 3 ~ "Townhome/duplex",
                                 TYPEHUQ %in% 4:5 ~ "Apartment") %>% as_factor()) %>%
    group_by(is_old, home_type, range_type) %>%
    summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost grouping variable
    mutate(pct = weight / sum(weight)) %>%
    filter(home_type %in% c("Single-family home", "Apartment"),
           range_type == "Gas") %>%
    ggplot(aes(x = is_old, y = pct)) +
    geom_col() +
    coord_flip() +
    facet_wrap(vars(home_type)) +
    labs(title = "Gas stove prevalence")
## `summarise()` has grouped output by 'is_old', 'home_type'. You can override
## using the `.groups` argument.

Aha! It turns out that our earlier finding that older homes were more likely to have gas stoves is actually a function of older apartments!

Explore the data

Fiddle around with the dataset on your own! There’s a codebook in the project folder, recs_codebook.xlsx.