11/29/2021

Introduction

  • A quick roundup of important pieces of information to understand and use data as part of normal reporting
  • You don’t need to be an expert to use data!
  • Keep an eye out for common pitfalls
  • Some parts of this might be basic to some of you, but hopefully everyone will find at least some of this helpful

Three tiers of journalistic data literacy

  • BASIC: You can parse a spreadsheet, request data, and avoid simple pitfalls
  • INTERMEDIATE: You’re comfortable with spreadsheets, working with formulas and pivot tables, and making simple charts to visualize data
  • ADVANCED: You’re familiar with one or more tools for working with big databases, code, geographic data, and how to request this data

My opinion: Not everyone needs to be a data expert. But in a mid-sized newsroom like MPR, every journalist should have at least Basic levels of data literacy, a majority should be Intermediate, and a few should be Advanced.

Terms to know

  • CSV: “Comma separated value,” a basic file format for spreadsheets, where each line is a row, & columns are separated by commas
    • Similar to Excel files (.XLS, .XLSX)
    • Open source and universal
    • Doesn’t store formatting
    • One sheet per document

Terms to know

  • GIS: Geographic Information System, the discipline of using computers to analyze geographic data and make maps
  • ArcGIS/QGIS: Two popular GIS programs. ArcGIS is the major commercial GIS program, while QGIS is the free open-source variant
  • Shapefile: A common file format for GIS data. Actually not one file but a collection of files with the same filename and different extensions — the map.shp file has the geographic data, the map.dbf file has accompanying data, and so on

Terms to know

  • R: A programming language optimized for statistics and data visualization
  • Python: A general-purpose programming language with strong support for data
  • JavaScript: A programming language focused on scripting web sites
  • D3: A JavaScript library focused on interactive visualizations
  • SQL: “Structured Query Language,” a programming language used for interacting with databases

Terms to know

  • Pivot table: A function in Excel and other spreadsheet programs to summarize a larger spreadsheet

Terms to know

  • Summary data/raw data

Raw data

## # A tibble: 8 × 3
##   carat cut       price
##   <dbl> <ord>     <chr>
## 1  0.23 Ideal     $326 
## 2  0.21 Premium   $326 
## 3  0.23 Good      $327 
## 4  0.29 Premium   $334 
## 5  0.31 Good      $335 
## 6  0.24 Very Good $336 
## 7  0.24 Very Good $336 
## 8  0.26 Very Good $337

One row per subject/incident

Summary data

## # A tibble: 5 × 3
##   cut       diamonds average_price
##   <ord>     <chr>    <chr>        
## 1 Fair      1,610    $4,359       
## 2 Good      4,906    $3,929       
## 3 Very Good 12,082   $3,982       
## 4 Premium   13,791   $4,584       
## 5 Ideal     21,551   $3,458



One row per category

Basics of analyzing data

  • Edit in a copy, not the original
  • Record your steps, maybe save as new files periodically
  • Never encode information in color or formatting

Basics of analyzing data

  • “Wide” vs. “Tall”/“Long” data
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11 wk12 wk13 wk14 wk15 wk16 wk17 wk18 wk19 wk20 wk21 wk22 wk23 wk24 wk25 wk26 wk27 wk28 wk29 wk30 wk31 wk32 wk33 wk34 wk35 wk36 wk37 wk38 wk39 wk40 wk41 wk42 wk43 wk44 wk45 wk46 wk47 wk48 wk49 wk50 wk51 wk52 wk53 wk54 wk55 wk56 wk57 wk58 wk59 wk60 wk61 wk62 wk63 wk64 wk65 wk66 wk67 wk68 wk69 wk70 wk71 wk72 wk73 wk74 wk75 wk76
2 Pac Baby Don’t Cry (Keep… 2000-02-26 87 82 72 77 87 94 99 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2Ge+her The Hardest Part Of … 2000-09-02 91 87 92 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66 57 54 53 51 51 51 51 47 44 38 28 22 18 18 14 12 7 6 6 6 5 5 4 4 4 4 3 3 3 4 5 5 9 9 15 14 13 14 16 17 21 22 24 28 33 42 42 49 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 Doors Down Loser 2000-10-21 76 76 72 69 67 65 55 59 62 61 61 59 61 66 72 76 75 67 73 70 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17 31 36 49 53 57 64 70 75 76 78 85 92 96 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
98^0 Give Me Just One Nig… 2000-08-19 51 39 34 26 26 19 2 2 3 6 7 22 29 36 47 67 66 84 93 94 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
A*Teens Dancing Queen 2000-07-08 97 97 96 95 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Aaliyah I Don’t Wanna 2000-01-29 84 62 51 41 38 35 35 38 38 36 37 37 38 49 61 63 62 67 83 86 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Aaliyah Try Again 2000-03-18 59 53 38 28 21 18 16 14 12 10 9 8 6 1 2 2 2 2 3 4 5 5 6 9 13 14 16 23 22 33 36 43 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Adams, Yolanda Open My Heart 2000-08-26 76 76 74 69 68 67 61 58 57 59 66 68 61 67 59 63 67 71 79 89 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Basics of analyzing data

  • “Wide” vs. “Tall”/“Long” data
artist track date.entered week rank
2 Pac Baby Don’t Cry (Keep… 2000-02-26 1 87
2 Pac Baby Don’t Cry (Keep… 2000-02-26 2 82
2 Pac Baby Don’t Cry (Keep… 2000-02-26 3 72
2 Pac Baby Don’t Cry (Keep… 2000-02-26 4 77
2 Pac Baby Don’t Cry (Keep… 2000-02-26 5 87
2 Pac Baby Don’t Cry (Keep… 2000-02-26 6 94
2 Pac Baby Don’t Cry (Keep… 2000-02-26 7 99
2Ge+her The Hardest Part Of … 2000-09-02 1 91
2Ge+her The Hardest Part Of … 2000-09-02 2 87
2Ge+her The Hardest Part Of … 2000-09-02 3 92

Basics of analyzing data

  • “Wide” vs. “Tall”/“Long” data
    • “Wide” data is often more easily read by humans; “Long” data is easier for machines to parse
    • “Long” data can lead to tidier column names
    • Sometimes one or the other will be easier to graph with whatever program you’re using
    • It’s not that hard to switch data between them, so don’t stress too much about this

Which data tool to use?

  • Calculator to compare a few numbers.
  • Spreadsheet (Excel, Google Sheets, LibreOffice) to compare hundreds or thousands.
  • Database or code (SQL, Access, R, Python) for hundreds of thousands of numbers

Pitfalls to watch for

  • Missing data
  • Duplicated data
  • Spelling differences (“Saint Louis” vs. “St. Louis”)
  • Date formats (“12/25/2021” vs. “2021-12-25”)
  • Sequences like 999999
  • Default dates like Jan. 1, 1970, or Jan. 1, 1900

Pitfalls to watch for

  • Confounding variables
  • Weekly or seasonal patterns
  • Inflation
  • Definition/boundary changes

Pitfalls to watch for

  • Excel row limits (1,048,576 rows, or 65,536)

Pitfalls to watch for

  • Excel converting things to dates

Basics of vizualizing data

  • Keep it simple
    • Avoid pointless embellishments like 3D bars or dozens of colors
  • Charts should tell a story
  • Recognize when words or tables would be better

Basics of vizualizing data

What type of chart to use

  • Are you comparing the relationship between two different values for each point of data? Scatterplot
  • Are you comparing values from several different categories? Bar chart
  • Are you showing a change in one metric over time? A line chart if your data is relatively smooth or continuous, perhaps a bar chart if not
  • Is geography important for your data? If so, make a map. If it’s not important, stick with a chart.

What type of chart not to use

  • No pie charts!
    • People are bad at judging which wedge of a pie chart is larger, if sizes are at all close
    • Instead, use a bar chart. It’s easier to compare relative values
  • No dual-axis charts!
    • Your choice of scales can drastically change the apparent relationship between your two pieces of data
    • Instead, convert the metrics you want to compare into percent change from a fixed starting date

What about truncated y-axes?

  • You can truncate your y-axis — in the right situation
  • Bar charts must always have y-axes that start at 0; if one bar is twice as tall as another, our minds assume the value of Bar A is twice as big as Bar B’s value, and truncated y-axes break that
  • For other kinds of charts, it’s generally OK to cut out wasted space — no one expects a summer weather chart to go down to 0º, for example
  • Be conscious of the story your choice of axis limits tells. Are you highlighting change or continuity? Is this fair?

Logarithmic scales

What is a logarithmic scale?

  • In a linear scale, you add a regular increment: 10, 20, 30, 40
  • In a logarithmic scale, you multiply a regular increment: 10, 100, 1,000, 10,000

Graphing logarithmic scales

When to use a log scale

  • When your growth is exponential (watch for J swoops!)
  • When you want to plot really small numbers and really large numbers next to each other
  • When the point is NOT to highlight outliers

Important log scales:

  • log2 is a sequence of doubling: 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048
  • log10 is a sequence increasing by 10: 10, 100, 1,000

Always explain!

Which scale to use?

How to ask for data

  • PDFs are not data!
  • Ask for data in “machine-readable” format such as CSV, Excel, or a database
  • Ask for the name of their database software
  • Ask for the “record layout” — the names of the rows and columns in the database, minus the actual data
  • Ask to talk to their IT person, who might actually know what’s going on
  • If they won’t give raw data, ask for summary data
  • Suggest they’d save themselves a lot of effort by just automatically publishing the data online

Sources for free data

Programs to consider:

  • Spreadsheets:
    • Excel: Industry standard
    • Google Sheets: Free online version
    • LibreOffice: Open-source alternative
    • Numbers: Default on Macs; avoid
  • GIS: QGIS is free and open-source
  • Extracting data from PDFs: Tabula
  • Programming: R is the primary language I use. You’ll need to download R, and you’ll want to download the app RStudio to provide a more user-friendly interface
  • A plaintext editor like Sublime Text or BBEdit

Resources

  • Edward Tufte, The Visual Display of Quantitative Information
  • Dona M. Wong, The Wall Street Journal Guide to Information Graphics
  • Scott Berinato, Good Charts: The HBR Guide to Making Smarter, More Persuasive Data Visualizations
  • Hadley Wickham & Garrett Grolemund, R for Data Science
  • The Quartz guide to bad data