11/29/2021

Introduction

  • A quick roundup of important pieces of information to understand and use data as part of normal reporting
  • You don’t need to be an expert to use data!
  • Keep an eye out for common pitfalls
  • Some parts of this might be basic to some of you, but hopefully everyone will find at least some of this helpful

Three tiers of journalistic data literacy

  • BASIC: You can parse a spreadsheet, request data, and avoid simple pitfalls
  • INTERMEDIATE: You’re comfortable with spreadsheets, working with formulas and pivot tables, and making simple charts to visualize data
  • ADVANCED: You’re familiar with one or more tools for working with big databases, code, geographic data, and how to request this data

My opinion: Not everyone needs to be a data expert. But in a mid-sized newsroom like MPR, every journalist should have at least Basic levels of data literacy, a majority should be Intermediate, and a few should be Advanced.

Terms to know

  • CSV: “Comma separated value,” a basic file format for spreadsheets, where each line is a row, & columns are separated by commas
    • Similar to Excel files (.XLS, .XLSX)
    • Open source and universal
    • Doesn’t store formatting
    • One sheet per document

Terms to know

  • GIS: Geographic Information System, the discipline of using computers to analyze geographic data and make maps
  • ArcGIS/QGIS: Two popular GIS programs. ArcGIS is the major commercial GIS program, while QGIS is the free open-source variant
  • Shapefile: A common file format for GIS data. Actually not one file but a collection of files with the same filename and different extensions — the map.shp file has the geographic data, the map.dbf file has accompanying data, and so on

Terms to know

  • R: A programming language optimized for statistics and data visualization
  • Python: A general-purpose programming language with strong support for data
  • JavaScript: A programming language focused on scripting web sites
  • D3: A JavaScript library focused on interactive visualizations
  • SQL: “Structured Query Language,” a programming language used for interacting with databases

Terms to know

  • Pivot table: A function in Excel and other spreadsheet programs to summarize a larger spreadsheet

Terms to know

  • Summary data/raw data

Raw data

## # A tibble: 8 × 3
##   carat cut       price
##   <dbl> <ord>     <chr>
## 1  0.23 Ideal     $326 
## 2  0.21 Premium   $326 
## 3  0.23 Good      $327 
## 4  0.29 Premium   $334 
## 5  0.31 Good      $335 
## 6  0.24 Very Good $336 
## 7  0.24 Very Good $336 
## 8  0.26 Very Good $337

One row per subject/incident

Summary data

## # A tibble: 5 × 3
##   cut       diamonds average_price
##   <ord>     <chr>    <chr>        
## 1 Fair      1,610    $4,359       
## 2 Good      4,906    $3,929       
## 3 Very Good 12,082   $3,982       
## 4 Premium   13,791   $4,584       
## 5 Ideal     21,551   $3,458



One row per category

Basics of analyzing data

  • Edit in a copy, not the original
  • Record your steps, maybe save as new files periodically
  • Never encode information in color or formatting

Basics of analyzing data

  • “Wide” vs. “Tall”/“Long” data
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11 wk12 wk13 wk14 wk15 wk16 wk17 wk18 wk19 wk20 wk21 wk22 wk23 wk24 wk25 wk26 wk27 wk28 wk29 wk30 wk31 wk32 wk33 wk34 wk35 wk36 wk37 wk38 wk39 wk40 wk41 wk42 wk43 wk44 wk45 wk46 wk47 wk48 wk49 wk50 wk51 wk52 wk53 wk54 wk55 wk56 wk57 wk58 wk59 wk60 wk61 wk62 wk63 wk64 wk65 wk66 wk67 wk68 wk69 wk70 wk71 wk72 wk73 wk74 wk75 wk76
2 Pac Baby Don’t Cry (Keep… 2000-02-26 87 82 72 77 87 94 99 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2Ge+her The Hardest Part Of … 2000-09-02 91 87 92 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66 57 54 53 51 51 51 51 47 44 38 28 22 18 18 14 12 7 6 6 6 5 5 4 4 4 4 3 3 3 4 5 5 9 9 15 14 13 14 16 17 21 22 24 28 33 42 42 49 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 Doors Down Loser 2000-10-21 76 76 72 69 67 65 55 59 62 61 61 59 61 66 72 76 75 67 73 70 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17 31 36 49 53 57 64 70 75 76 78 85 92 96 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
98^0 Give Me Just One Nig… 2000-08-19 51 39 34 26 26 19 2 2 3 6 7 22 29 36 47 67 66 84 93 94 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
A*Teens Dancing Queen 2000-07-08 97 97 96 95 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Aaliyah I Don’t Wanna 2000-01-29 84 62 51 41 38 35 35 38 38 36 37 37 38 49 61 63 62 67 83 86 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Aaliyah Try Again 2000-03-18 59 53 38 28 21 18 16 14 12 10 9 8 6 1 2 2 2 2 3 4 5 5 6 9 13 14 16 23 22 33 36 43 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Adams, Yolanda Open My Heart 2000-08-26 76 76 74 69 68 67 61 58 57 59 66 68 61 67 59 63 67 71 79 89 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Basics of analyzing data

  • “Wide” vs. “Tall”/“Long” data
artist track date.entered week rank
2 Pac Baby Don’t Cry (Keep… 2000-02-26 1 87
2 Pac Baby Don’t Cry (Keep… 2000-02-26 2 82
2 Pac Baby Don’t Cry (Keep… 2000-02-26 3 72
2 Pac Baby Don’t Cry (Keep… 2000-02-26 4 77
2 Pac Baby Don’t Cry (Keep… 2000-02-26 5 87
2 Pac Baby Don’t Cry (Keep… 2000-02-26 6 94
2 Pac Baby Don’t Cry (Keep… 2000-02-26 7 99
2Ge+her The Hardest Part Of … 2000-09-02 1 91
2Ge+her The Hardest Part Of … 2000-09-02 2 87
2Ge+her The Hardest Part Of … 2000-09-02 3 92

Basics of analyzing data

  • “Wide” vs. “Tall”/“Long” data
    • “Wide” data is often more easily read by humans; “Long” data is easier for machines to parse
    • “Long” data can lead to tidier column names
    • Sometimes one or the other will be easier to graph with whatever program you’re using
    • It’s not that hard to switch data between them, so don’t stress too much about this

Which data tool to use?

  • Calculator to compare a few numbers.
  • Spreadsheet (Excel, Google Sheets, LibreOffice) to compare hundreds or thousands.
  • Database or code (SQL, Access, R, Python) for hundreds of thousands of numbers

Pitfalls to watch for

  • Missing data
  • Duplicated data
  • Spelling differences (“Saint Louis” vs. “St. Louis”)
  • Date formats (“12/25/2021” vs. “2021-12-25”)
  • Sequences like 999999
  • Default dates like Jan. 1, 1970, or Jan. 1, 1900

Pitfalls to watch for

  • Confounding variables
  • Weekly or seasonal patterns
  • Inflation
  • Definition/boundary changes

Pitfalls to watch for

  • Excel row limits (1,048,576 rows, or 65,536)

Pitfalls to watch for

  • Excel converting things to dates

Basics of vizualizing data

  • Keep it simple
    • Avoid pointless embellishments like 3D bars or dozens of colors
  • Charts should tell a story
  • Recognize when words or tables would be better

Basics of vizualizing data