Survey microdata often comes with one or more columns of “weights,” calculated variables indicating how many people (or households) each respondent to the survey is supposed to represent.
The logic of using weights is very simple: you multiply each respondent by its weight. For example, imagine a made-up two-person survey asking if people preferred Coke or Pepsi, where Person A said “Coke” and Person B “Pepsi.” In an unrepresentative survey, we’d say the results were 50%-50%. But say this were representative, and based on demographics we said Person A represented 150 people and Person B 100 people. Now instead of 1 vote for Coke and 1 vote for Pepsi, we have 150 votes for Coke and 100 for Pepsi, or 55%.
Moving back into reality, we can look at available survey microdata of the sort produced by the American Community Survey, the Current Population Survey, and others. All these datasets, available from IPUMS, include weighting variables.
You can use Excel or online tools to handle this kind of weighted survey microdata. But especially for some of these larger datasets, command-line tools like R are powerful — and surprisingly easy to use.
Here’s how to work with survey data in R.
The only package we’ll need for this is the
tidyverse
:
library(tidyverse)
For this lesson, we’re going to work with data from the U.S. Energy Information Administration’s invaluable Residential Energy Consumption Survey (RECS) about a topic of recent interest: gas stoves.
A subset of this data is in your project file in R’s hyper-efficient
RDS
format, and you can load it in with a single line of
code:
load("recs.rds")
Take a look at the data, which is 1,368 responses from the 2020 RECS — just those responses from Minnesota, Montana, North Dakota, South Dakota and Wisconsin.
There’s lots of data here, but I want to focus on just a few columns:
state_name
or state_postal
identify the
resident’s stateRANGEFUEL
indicates the fuel powering the home’s
primary stoveRANGEINDT
indicates whether an electric stove is an
induction range or notNWEIGHT
is our weighting variablerecs %>% select(state_name, state_postal, RANGEFUEL, RANGEINDT, NWEIGHT)
## # A tibble: 1,368 × 5
## state_name state_postal RANGEFUEL RANGEINDT NWEIGHT
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Minnesota MN -2 -2 6151.
## 2 South Dakota SD 5 0 1998.
## 3 Minnesota MN 5 0 5349.
## 4 Minnesota MN -2 -2 7120.
## 5 North Dakota ND 13 -2 1156.
## 6 North Dakota ND 5 0 1327.
## 7 North Dakota ND 5 0 1508.
## 8 Wisconsin WI 5 0 6823.
## 9 Wisconsin WI 1 -2 6217.
## 10 Wisconsin WI 1 -2 5363.
## # … with 1,358 more rows
This involves a few different variables to get at the question we really want to answer: what kind of stove does a house have? So let’s make a new variable to condense all this down. Run the following code:
recs <- recs %>%
mutate(range_type = case_when(RANGE == 0 ~ "None",
RANGEFUEL == 1 ~ "Gas",
RANGEFUEL == 2 ~ "Propane",
RANGEFUEL == 13 ~ "Dual-fuel",
RANGEFUEL == 5 & RANGEINDT == 0 ~ "Electric",
RANGEFUEL == 5 & RANGEINDT == 1 ~ "Induction",
TRUE ~ NA_character_) %>%
fct_relevel("Electric", "Gas", "Propane", "Dual-fuel", "Induction", "None"), # Put our range types in order
.after = "NWEIGHT") # Where the new column should be placed; without this it would default to the end
Now that we’ve got our data loaded, let’s see how to handle survey weights in R. It’s pretty easy!
recs %>% # Start with your data
group_by(range_type) %>% # Group by the variables you to analyze
summarize(weight = sum(NWEIGHT)) %>% # Sum by your weighting variable
mutate(pct = weight / sum(weight)) # Convert to percents
## # A tibble: 6 × 3
## range_type weight pct
## <fct> <dbl> <dbl>
## 1 Electric 3527825. 0.617
## 2 Gas 1307520. 0.229
## 3 Propane 248785. 0.0435
## 4 Dual-fuel 195092. 0.0341
## 5 Induction 78702. 0.0138
## 6 None 356062. 0.0623
Now you can see that about 3.5 million households, or 61.7 percent, have electric ranges, while 1.3 million households or 22.8 percent have gas stoves.
A simple graph could show this more clearly:
recs %>% # Start with your data
group_by(range_type) %>% # Group by the variables you to analyze
summarize(weight = sum(NWEIGHT)) %>% # Sum by your weighting variable
mutate(pct = weight / sum(weight)) %>% # Convert to percents
ggplot(aes(x = range_type, y = pct)) + # Create a graph with range_type and pct as our axis variables
geom_col() + # Make it a bar graph
coord_flip() # Flip the chart on its side
But often we’re interested in more than one variable! Fortunately, this trick scales up as much as you want. You just group by as many different variables as you want to analyze, then sum your weighting variable.
Here’s our data, broken out by state:
recs %>% # Start with your data
group_by(state_name, range_type) %>% # Group by the variables you to analyze
summarize(weight = sum(NWEIGHT)) %>% # Sum by your weighting variable
mutate(pct = weight / sum(weight)) # Convert to percents
## `summarise()` has grouped output by 'state_name'. You can override using the
## `.groups` argument.
## # A tibble: 30 × 4
## # Groups: state_name [5]
## state_name range_type weight pct
## <chr> <fct> <dbl> <dbl>
## 1 Minnesota Electric 1305204. 0.585
## 2 Minnesota Gas 480900. 0.216
## 3 Minnesota Propane 108876. 0.0488
## 4 Minnesota Dual-fuel 79992. 0.0359
## 5 Minnesota Induction 37763. 0.0169
## 6 Minnesota None 217048. 0.0973
## 7 Montana Electric 286868. 0.662
## 8 Montana Gas 89263. 0.206
## 9 Montana Propane 13279. 0.0306
## 10 Montana Dual-fuel 12061. 0.0278
## # … with 20 more rows
recs %>% # Start with your data
group_by(state_name, range_type) %>% # Group by the variables you to analyze
summarize(weight = sum(NWEIGHT)) %>% # Sum by your weighting variable
mutate(pct = weight / sum(weight)) %>% # Convert to percents
ggplot(aes(x = range_type, y = pct)) + # Create a graph with range_type and pct as our axis variables
geom_col() + # Make it a bar graph
coord_flip() + # Flip the chart on its side
facet_wrap(vars(state_name))
## `summarise()` has grouped output by 'state_name'. You can override using the
## `.groups` argument.
You can see that gas stoves are particularly rare in North Dakota, among other potentially interesting conclusions.
One thing to be aware of: in R, when you group_by()
and
then summarize()
, it automatically removes the right-most
group from your data. Here’s what this means:
recs %>%
group_by(range_type) %>% # Group by ange_type
summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost (and only) grouping variable
mutate(pct = weight / sum(weight))
## # A tibble: 6 × 3
## range_type weight pct
## <fct> <dbl> <dbl>
## 1 Electric 3527825. 0.617
## 2 Gas 1307520. 0.229
## 3 Propane 248785. 0.0435
## 4 Dual-fuel 195092. 0.0341
## 5 Induction 78702. 0.0138
## 6 None 356062. 0.0623
In the above code, running
mutate(pct = weight / sum(weight))
produced the percentages
we expected because the data was *no longer grouped by
range_type
.
If the data were still grouped, it would divide
weight
by sum(weight)
for each group,
unhelpfully producing 1s!
recs %>%
group_by(range_type) %>%
summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost (and only) grouping variable
group_by(range_type) %>% # Re-adding our grouping variable
mutate(pct = weight / sum(weight))
## # A tibble: 6 × 3
## # Groups: range_type [6]
## range_type weight pct
## <fct> <dbl> <dbl>
## 1 Electric 3527825. 1
## 2 Gas 1307520. 1
## 3 Propane 248785. 1
## 4 Dual-fuel 195092. 1
## 5 Induction 78702. 1
## 6 None 356062. 1
If you have multiple grouping variables, it removes the right-most.
When we group our data by state_name
and
range_type
, that means our percents will be calculated
within each state as the remaining grouped variable.
recs %>%
group_by(state_name, range_type) %>%
summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost grouping variable
mutate(pct = weight / sum(weight))
## `summarise()` has grouped output by 'state_name'. You can override using the
## `.groups` argument.
## # A tibble: 30 × 4
## # Groups: state_name [5]
## state_name range_type weight pct
## <chr> <fct> <dbl> <dbl>
## 1 Minnesota Electric 1305204. 0.585
## 2 Minnesota Gas 480900. 0.216
## 3 Minnesota Propane 108876. 0.0488
## 4 Minnesota Dual-fuel 79992. 0.0359
## 5 Minnesota Induction 37763. 0.0169
## 6 Minnesota None 217048. 0.0973
## 7 Montana Electric 286868. 0.662
## 8 Montana Gas 89263. 0.206
## 9 Montana Propane 13279. 0.0306
## 10 Montana Dual-fuel 12061. 0.0278
## # … with 20 more rows
When the data’s been grouped like this, each state’s values should add up to 1:
0.585350262 + 0.215671046 + 0.048828105 + 0.035874385 + 0.016935786 + 0.097340417
## [1] 1
When in doubt, manually apply the grouping you want!
Sometimes we have a continuous variable in our survey responses that we want to analyze. But you can’t group by a continuous variable. Similarly, sometimes a categorical variable includes way more detail than we want.
What we need to do is create a new categorical variable that condenses our data into discrete chunks.
For example, perhaps we were interested in whether older homes are
more or less likely to have gas stoves than newer homes. The RECS has a
variable, YEARMADERANGE
, with very detailed buckets: “2”
indicates 1950-59, “3” is 1960-69, and so on. In this case, we could
check simply homes with a YEARMADERANGE
of “1,” meaning
homes built before 1950:
recs %>%
select(state_name, range_type, YEARMADERANGE, NWEIGHT) %>%
mutate(is_old = case_when(YEARMADERANGE == 1 ~ "Old", TRUE ~ "New"))
## # A tibble: 1,368 × 5
## state_name range_type YEARMADERANGE NWEIGHT is_old
## <chr> <fct> <dbl> <dbl> <chr>
## 1 Minnesota None 2 6151. New
## 2 South Dakota Electric 6 1998. New
## 3 Minnesota Electric 7 5349. New
## 4 Minnesota None 1 7120. Old
## 5 North Dakota Dual-fuel 8 1156. New
## 6 North Dakota Electric 3 1327. New
## 7 North Dakota Electric 4 1508. New
## 8 Wisconsin Electric 2 6823. New
## 9 Wisconsin Gas 4 6217. New
## 10 Wisconsin Gas 7 5363. New
## # … with 1,358 more rows
The case_when()
function has a little bit of an odd
syntax. Here’s how that works, using a ~
to separate a
statistical test on the left from the output if true on the right:
data %>%
mutate(varname = case_when(test1 ~ output1,
test2 ~ output2,
...,
TRUE ~ output_in_all_other_cases))
Going back to our data, we can now calculate a result:
recs %>%
mutate(is_old = case_when(YEARMADERANGE == 1 ~ "Old", TRUE ~ "New")) %>%
group_by(is_old, range_type) %>%
summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost grouping variable
mutate(pct = weight / sum(weight))
## `summarise()` has grouped output by 'is_old'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 4
## # Groups: is_old [2]
## is_old range_type weight pct
## <chr> <fct> <dbl> <dbl>
## 1 New Electric 2915334. 0.640
## 2 New Gas 930354. 0.204
## 3 New Propane 208843. 0.0459
## 4 New Dual-fuel 127675. 0.0280
## 5 New Induction 68114. 0.0150
## 6 New None 303930. 0.0667
## 7 Old Electric 612490. 0.528
## 8 Old Gas 377166. 0.325
## 9 Old Propane 39941. 0.0344
## 10 Old Dual-fuel 67417. 0.0581
## 11 Old Induction 10589. 0.00913
## 12 Old None 52132. 0.0450
Here we can see that gas stoves are actually more common in older homes than newer ones!
So far we’ve worked with one or two grouping variables. We can go further!
Let’s stick with our home-age variable, and add in another variable about home type:
recs %>%
mutate(is_old = case_when(YEARMADERANGE == 1 ~ "Old", TRUE ~ "New")) %>%
mutate(home_type = case_when(TYPEHUQ == 1 ~ "Mobile home",
TYPEHUQ == 2 ~ "Single-family home",
TYPEHUQ == 3 ~ "Townhome/duplex",
TYPEHUQ %in% 4:5 ~ "Apartment") %>% as_factor()) %>%
group_by(is_old, home_type, range_type) %>%
summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost grouping variable
mutate(pct = weight / sum(weight))
## `summarise()` has grouped output by 'is_old', 'home_type'. You can override
## using the `.groups` argument.
## # A tibble: 33 × 5
## # Groups: is_old, home_type [7]
## is_old home_type range_type weight pct
## <chr> <fct> <fct> <dbl> <dbl>
## 1 New Single-family home Electric 1604148. 0.553
## 2 New Single-family home Gas 736487. 0.254
## 3 New Single-family home Propane 157628. 0.0544
## 4 New Single-family home Dual-fuel 75266. 0.0260
## 5 New Single-family home Induction 39115. 0.0135
## 6 New Single-family home None 286884. 0.0989
## 7 New Apartment Electric 1045174. 0.868
## 8 New Apartment Gas 89308. 0.0742
## 9 New Apartment Dual-fuel 34190. 0.0284
## 10 New Apartment Induction 24312. 0.0202
## # … with 23 more rows
Remembering our grouping, we can see that this output is grouped by whether the home was built before 1950, and what type of home it is. So our percentages for old apartments add up to 1; so do our percentages for new apartments, for old single-family homes, etc.
But this is a bit hectic and hard to visualize. So let’s simplify this down: compare single-family homes and apartments by their prevalence of gas stoves:
recs %>%
mutate(is_old = case_when(YEARMADERANGE == 1 ~ "Old", TRUE ~ "New")) %>%
mutate(home_type = case_when(TYPEHUQ == 1 ~ "Mobile home",
TYPEHUQ == 2 ~ "Single-family home",
TYPEHUQ == 3 ~ "Townhome/duplex",
TYPEHUQ %in% 4:5 ~ "Apartment") %>% as_factor()) %>%
group_by(is_old, home_type, range_type) %>%
summarize(weight = sum(NWEIGHT)) %>% # Summary removes the rightmost grouping variable
mutate(pct = weight / sum(weight)) %>%
filter(home_type %in% c("Single-family home", "Apartment"),
range_type == "Gas") %>%
ggplot(aes(x = is_old, y = pct)) +
geom_col() +
coord_flip() +
facet_wrap(vars(home_type)) +
labs(title = "Gas stove prevalence")
## `summarise()` has grouped output by 'is_old', 'home_type'. You can override
## using the `.groups` argument.
Aha! It turns out that our earlier finding that older homes were more likely to have gas stoves is actually a function of older apartments!
Fiddle around with the dataset on your own! There’s a codebook in the
project folder, recs_codebook.xlsx
.