Misadventures with Reproducibility in R

R Ladies Melbourne, 29 Nov 2022

Reproducibility! Now!

Why reproducibility?

An illustration of reasons why we should care about working reproducibly from The Turing Way, Guide for Reproducible Research

An ever growing pile of tools…

  • Quarto, Rmarkdown
  • Git/GitHub
  • R functions, scripts, packages
  • {targets}, {renv}, {lintr}, {styler}

We’re awash in information! What we need is curation.

a bit about my journey…

  • majored in economics at Unimelb – no coding! mostly theory.. very little econometrics…
  • taught myself bits of R, python and shell scripting while working as a research assistant at UniMelb
  • designed, collected, wrangled and explored all sorts of wild caught data – surveys, archives, time-series, spatial, panel, text…
  • recently began dipping my toes into machine learning and deep learning
  • currently working on interdisciplinary data science methods with applications to panel data harmonisation and satellite deep learning

Mapping the Landscape

A Gentle Stroll

Tidy data & code

“Like families, tidy datasets are all alike but every messy dataset is messy in its own way” - {tidyr}: Tidy data

“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.” - The tidyverse style guide

Jump in here:

Digital illustration of two cute fuzzy monsters sitting on a park bench with a smiling data table between them, all eating ice cream together. In text above the illustration are the hand drawn words "make friends with tidy data."

Illustration from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Literate Programming

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want the computer to do. – (Knuth 1984)

From the Modern Data Book by Martin Shepperd:

  1. move away from writing programs to ‘please’ the computer
  2. instead, focus on communication and understanding
  3. create a single document to integrate data analysis (executable code) with textual documentation, linking data, code, and explanation

Literate Programming

R Packages

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. - R Packages (2e)

From simple to more involved:

Up, Up and Away

Version Control

A version control system, or VCS, tracks the history of changes as people and teams collaborate on projects together. As developers make changes to the project, any earlier version of the project can be recovered at any time.

- GitHub Docs: About Git

The Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.

Version Control

Some good starting points:

Experimenting with more advanced features:

Command Line

“The command line is a tool for talking to your operating system (e.g., macOS, Windows, etc.) using text instead of by moving around a mouse and clicking on things”

- The Command Line from Practical Data Science by Nick Eubank

Command Line

Dip your toes in with:

Then dive deeper…

Collaboration

From The Turing Way, Guide for Collaboration:

The Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.

Collaboration

Some (often) good enough tools:

  • Google Docs, Slides, and Sheets
  • Dropbox
  • GitHub

Some R specific resources:

Out to Sea

Environments

Ways of capturing computational environments from The Turing Way, Guide for Reproducible Research

Possible starting points:

Workflows and Pipelines

A pipeline is a computational workflow that does statistics, analytics, or data science… A pipeline contains tasks to prepare datasets, run models, and summarize results for a business deliverable or research paper.

- {targets} Overview

On my to-explore list:

Testing and Validation

The Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.

Motivation and guidance on testing:

Lessons from my misadventures

Priorities, priorities, priorities…

  • Keep a list of frictions:
    • copy & pasting? comparing files?
    • “it would be nice if….”
  • Consider who you are learning for:
    • current you? future you? unknown others?
  • What outcomes have the most value for you?
    • automation, version control & experiment tracking, research communication?

Keep it up!

  • Get help!
  • Share your successes
  • Incremental improvements are better than complete overhauls
  • Carve out time to experiment with features

Special mentions to:

Thanks for listening!

Find me @cynthiahqy on:

Some shameless plugs: