home / skills / benchflow-ai / skillsbench / r-data-science

r-data-science skill

/registry/terminal_bench_2.0/full_batch_reviewed/terminal_bench_2_0_mcmc-sampling-stan/environment/skills/r-data-science

This skill generates high-quality R code and workflows using tidyverse, reproducibility practices, and defensive patterns for data analysis and visualization.

npx playbooks add skill benchflow-ai/skillsbench --skill r-data-science

Review the files below or copy the command above to add this skill to your agents.

Files (9)
SKILL.md
9.6 KB
---
name: r-data-science
description: "R programming for data analysis, visualization, and statistical workflows. Use when working with R scripts (.R), Quarto documents (.qmd), RMarkdown (.Rmd), or R projects. Covers tidyverse workflows, ggplot2 visualizations, statistical analysis, epidemiological methods, and reproducible research practices."
---

# R Data Science

## Overview

Generate high-quality R code following tidyverse conventions and modern best practices. This skill covers data manipulation, visualization, statistical analysis, and reproducible research workflows commonly used in public health, epidemiology, and data science.

## Core Principles

1. **Tidyverse-first**: Use tidyverse packages (dplyr, tidyr, ggplot2, purrr, readr) as the default approach
2. **Pipe-forward**: Use the native pipe `|>` for chains (R 4.1+); fall back to `%>%` for older versions
3. **Reproducibility**: Structure all work for reproducibility using Quarto, renv, and clear documentation
4. **Defensive coding**: Validate inputs, handle missing data explicitly, and fail informatively

## Quick Reference: Common Patterns

### Data Import
```r
library(tidyverse)

# CSV (most common)
df <- read_csv("data/raw/dataset.csv")

# Excel
df <- readxl::read_excel("data/raw/dataset.xlsx", sheet = "Sheet1")

# Clean column names immediately
df <- df |> janitor::clean_names()
```

### Data Wrangling Pipeline
```r
analysis_data <- raw_data |>
  # Clean and filter
  filter(!is.na(key_variable)) |>

  # Transform variables
  mutate(
    date = as.Date(date_string, format = "%Y-%m-%d"),
    age_group = cut(age, breaks = c(0, 18, 45, 65, Inf),
                    labels = c("0-17", "18-44", "45-64", "65+"))
  ) |>

  # Summarize
  group_by(region, age_group) |>
  summarize(
    n = n(),
    mean_value = mean(outcome, na.rm = TRUE),
    .groups = "drop"
  )
```

### Basic ggplot2 Visualization
```r
ggplot(df, aes(x = date, y = count, color = category)) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Trend Over Time",
    subtitle = "By category",
    x = "Date",
    y = "Count",
    color = "Category",
    caption = "Source: Dataset Name"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )
```

## Tidyverse Style Guide Essentials

### Naming Conventions
- **snake_case** for objects and functions: `case_counts`, `calculate_rate()`
- **Verbs for functions**: `filter_outliers()`, `compute_summary()`
- **Nouns for data**: `patient_data`, `surveillance_df`
- **Avoid**: dots in names (reserved for S3), single letters except in lambdas

### Code Formatting
- **Indentation**: 2 spaces (never tabs)
- **Line length**: 80 characters maximum
- **Operators**: Spaces around `<-`, `=`, `+`, `|>`, but not `:`, `::`, `$`
- **Commas**: Space after, never before
- **Pipes**: New line after each `|>`

```r
# Good
result <- data |>
  filter(year >= 2020) |>
  group_by(county) |>
  summarize(total = sum(cases))

# Bad
result<-data|>filter(year>=2020)|>group_by(county)|>summarize(total=sum(cases))
```

### Assignment
- Use `<-` for assignment, never `=` or `->`
- Use `=` only for function arguments

### Comments
```r
# Load and clean surveillance data ------------------------------------------

# Calculate age-adjusted rates
# Using direct standardization method per CDC guidelines
adjusted_rate <- calculate_adjusted_rate(df, standard_pop)
```

## Package Ecosystem

### Core Tidyverse (Always Load)
```r
library(tidyverse)  # Loads: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats
```

### Data Import/Export
| Task | Package | Key Functions |
|------|---------|---------------|
| CSV/TSV | readr | `read_csv()`, `write_csv()` |
| Excel | readxl, writexl | `read_excel()`, `write_xlsx()` |
| SAS/SPSS/Stata | haven | `read_sas()`, `read_spss()`, `read_stata()` |
| JSON | jsonlite | `read_json()`, `fromJSON()` |
| Databases | DBI, dbplyr | `dbConnect()`, `tbl()` |

### Data Manipulation
| Task | Package | Key Functions |
|------|---------|---------------|
| Column cleaning | janitor | `clean_names()`, `tabyl()` |
| Date handling | lubridate | `ymd()`, `mdy()`, `floor_date()` |
| String operations | stringr | `str_detect()`, `str_extract()` |
| Missing data | naniar | `vis_miss()`, `replace_with_na()` |

### Visualization
| Task | Package | Key Functions |
|------|---------|---------------|
| Core plotting | ggplot2 | `ggplot()`, `geom_*()` |
| Extensions | ggrepel, patchwork | `geom_text_repel()`, `+` operator |
| Interactive | plotly | `ggplotly()` |
| Tables | gt, kableExtra | `gt()`, `kable()` |

### Statistical Analysis
| Task | Package | Key Functions |
|------|---------|---------------|
| Model summaries | broom | `tidy()`, `glance()`, `augment()` |
| Regression | stats, lme4 | `lm()`, `glm()`, `lmer()` |
| Survival | survival | `Surv()`, `survfit()`, `coxph()` |
| Survey data | survey | `svydesign()`, `svymean()` |

### Epidemiology & Public Health
| Task | Package | Key Functions |
|------|---------|---------------|
| Epi calculations | epiR | `epi.2by2()`, `epi.conf()` |
| Outbreak tools | incidence2, epicontacts | `incidence()`, `make_epicontacts()` |
| Disease mapping | SpatialEpi | `expected()`, `EBlocal()` |
| Surveillance | surveillance | `sts()`, `farrington()` |
| Rate calculations | epitools | `riskratio()`, `oddsratio()`, `ageadjust.direct()` |

## Reproducibility Standards

### Project Structure
```
project/
├── project.Rproj
├── renv.lock
├── CLAUDE.md              # Claude Code configuration
├── README.md
├── data/
│   ├── raw/               # Never modify
│   └── processed/         # Analysis-ready
├── R/                     # Custom functions
├── scripts/               # Pipeline scripts
├── analysis/              # Quarto documents
└── output/
    ├── figures/
    └── tables/
```

### Quarto Document Header
```yaml
---
title: "Analysis Title"
author: "Your Name"
date: today
format:
  html:
    toc: true
    code-fold: true
    embed-resources: true
execute:
  warning: false
  message: false
---
```

### Package Management with renv
```r
# Initialize (once per project)
renv::init()

# Snapshot dependencies after installing packages
renv::snapshot()

# Restore environment (for collaborators)
renv::restore()
```

### Workflow Documentation
Always include at the top of scripts:
```r
# ============================================================================
# Title: Analysis of [Subject]
# Author: [Name]
# Date: [Date]
# Purpose: [One-sentence description]
# Input: data/processed/clean_data.csv
# Output: output/figures/trend_plot.png
# ============================================================================
```

## Common Analysis Patterns

### Descriptive Statistics Table
```r
df |>
  group_by(category) |>
  summarize(
    n = n(),
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    q25 = quantile(value, 0.25, na.rm = TRUE),
    q75 = quantile(value, 0.75, na.rm = TRUE)
  ) |>
  gt::gt() |>
  gt::fmt_number(columns = where(is.numeric), decimals = 2)
```

### Regression with Tidy Output
```r
model <- glm(outcome ~ exposure + age + sex, data = df, family = binomial)

# Tidy coefficients
tidy_results <- broom::tidy(model, conf.int = TRUE, exponentiate = TRUE) |>
  select(term, estimate, conf.low, conf.high, p.value)

# Model diagnostics
glance_results <- broom::glance(model)
```

### Epi Curve (Epidemic Curve)
```r
library(incidence2)

# Create incidence object
inc <- incidence(
  df,
  date_index = "onset_date",
  interval = "week",
  groups = "outcome_category"
)

# Plot
plot(inc) +
  labs(
    title = "Epidemic Curve",
    x = "Week of Onset",
    y = "Number of Cases"
  ) +
  theme_minimal()
```

### Rate Calculation
```r
# Age-adjusted rates using direct standardization
library(epitools)

# Stratum-specific counts and populations
result <- ageadjust.direct(
  count = df$cases,
  pop = df$population,
  stdpop = standard_population$pop  # e.g., US 2000 standard
)
```

## Error Handling

### Defensive Data Checks
```r
# Validate data before analysis
stopifnot(
  "Data frame is empty" = nrow(df) > 0,
  "Missing required columns" = all(c("id", "date", "value") %in% names(df)),
  "Duplicate IDs found" = !any(duplicated(df$id))
)

# Informative warnings for data quality issues
if (sum(is.na(df$key_var)) > 0) {
  warning(sprintf("%d missing values in key_var (%.1f%%)",
                  sum(is.na(df$key_var)),
                  100 * mean(is.na(df$key_var))))
}
```

### Safe File Operations
```r
# Check file exists before reading
if (!file.exists(filepath)) {
  stop(sprintf("File not found: %s", filepath))
}

# Create directories if needed
dir.create("output/figures", recursive = TRUE, showWarnings = FALSE)
```

## Performance Tips

### For Large Datasets
```r
# Use data.table for >1M rows
library(data.table)
dt <- fread("large_file.csv")

# Or use arrow for very large/parquet files
library(arrow)
df <- read_parquet("data.parquet")

# Lazy evaluation with duckdb
library(duckdb)
con <- dbConnect(duckdb())
df_lazy <- tbl(con, "data.csv")
```

### Vectorization Over Loops
```r
# Good: vectorized
df$rate <- df$cases / df$population * 100000

# Avoid: row-by-row loop
for (i in 1:nrow(df)) {
  df$rate[i] <- df$cases[i] / df$population[i] * 100000
}
```

## Additional Resources

For detailed patterns, consult:
- **Tidyverse Style Guide**: https://style.tidyverse.org/
- **R for Data Science (2e)**: https://r4ds.hadley.nz/
- **The Epidemiologist R Handbook**: https://epirhandbook.com/
- **Quarto Documentation**: https://quarto.org/

## Version History

- v1.0.0 (2025-12-04): Initial release for PubHealthAI community

Overview

This skill provides practical, production-ready R code and patterns for data analysis, visualization, and reproducible workflows using tidyverse conventions. It focuses on clear, defendable pipelines for public health, epidemiology, and general data science tasks. The goal is to enable fast, reproducible analyses with readable code and sensible defaults.

How this skill works

The skill generates R code and project templates that follow a tidyverse-first approach, using the native pipe (|>) where available and falling back to %>% when needed. It provides idiomatic examples for data import, cleaning, transformation, visualization with ggplot2, and statistical modeling with tidy outputs via broom. It also embeds reproducibility practices: renv for package management, Quarto headers for reports, and well-structured project layouts.

When to use it

  • Writing or reviewing R scripts (.R), RMarkdown (.Rmd), or Quarto (.qmd) documents
  • Building reproducible analysis projects for public health or epidemiology
  • Creating ggplot2 visualizations and publication-ready figures
  • Preparing tidy data pipelines for modeling or reporting
  • Managing dependencies and collaborative environments with renv

Best practices

  • Prefer tidyverse functions (dplyr, tidyr, ggplot2) and load library(tidyverse) early
  • Use native pipe |> for readability; maintain line breaks after each pipe
  • Adopt snake_case for objects and verbs for function names; use <- for assignment
  • Include defensive checks (stopifnot, file.exists) and informative warnings for missing data
  • Structure projects with data/raw, data/processed, R/, analysis/, and output/ directories
  • Snapshot dependencies with renv::snapshot() and restore with renv::restore()

Example use cases

  • Clean raw surveillance CSVs, create age groups, and produce weekly incidence curves
  • Fit logistic regression models, tidy coefficient tables with broom, and export results
  • Produce multi-panel ggplot2 figures for a Quarto report with consistent theme and captions
  • Calculate age-adjusted rates using epitools and document methods in Quarto
  • Handle large datasets by delegating to data.table, arrow, or duckdb for performance

FAQ

Which pipe should I use, |> or %>%?

Use the native pipe |> when running R 4.1+ for forward chaining; use %>% only for compatibility with older environments.

How do I make analysis reproducible for collaborators?

Use renv to snapshot dependencies, include a clear project structure, and render Quarto documents with embedded code and metadata.