home / skills / benchflow-ai / skillsbench / academic-pdf-redaction

academic-pdf-redaction skill

safe

/tasks/paper-anonymizer/environment/skills/academic-pdf-redaction

This skill redacts identifying information from academic PDFs for blind review while preserving references and ensuring data integrity.

npx playbooks add skill benchflow-ai/skillsbench --skill academic-pdf-redaction

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.7 KB

---
name: academic-pdf-redaction
description: Redact text from PDF documents for blind review anonymization
---

# PDF Redaction for Blind Review

Redact identifying information from academic papers for blind review.

## CRITICAL RULES

1. **PRESERVE References section** - Self-citations MUST remain intact
2. **ONLY redact specific text matches** - Never redact entire pages/regions
3. **VERIFY output** - Check that 80%+ of original text remains

## Common Pitfalls to AVOID

```python
# ❌ WRONG - This removes ALL text from the page:
for block in page.get_text("blocks"):
    page.add_redact_annot(fitz.Rect(block[:4]))

# ❌ WRONG - Drawing rectangles over text:
page.draw_rect(fitz.Rect(0, 0, 600, 100), fill=(0,0,0))

# ✅ CORRECT - Only redact specific search matches:
for rect in page.search_for("John Smith"):
    page.add_redact_annot(rect)
```

## Patterns to Redact (Before References Only)

**IMPORTANT: Use FULL names/phrases, not partial matches!**
- ✅ "John Smith" (full name)
- ❌ "Smith" (partial - would incorrectly match "Smith et al." citations in References)

1. **Author names** - FULL names only (e.g., "John Smith", not just "Smith")
2. **Affiliations** - Universities, companies (e.g., "Duke University")
3. **Email addresses** - Pattern: `*@*.edu`, `*@*.com`
4. **Venue names** - Conference/workshop names (e.g., "ICML 2024", "ICML Workshop")
5. **arXiv identifiers** - Pattern: `arXiv:XXXX.XXXXX`
6. **DOIs** - Pattern: `10.XXXX/...`
7. **Acknowledgement names** - Names in "Acknowledgements" section
8. **Equal contribution footnotes** - e.g., "Equal contribution", "* Equal contribution"

## PyMuPDF (fitz) - Recommended Approach

```python
import fitz
import os

def redact_with_pymupdf(input_path: str, output_path: str, patterns: list[str]):
    """Redact specific patterns from PDF using PyMuPDF."""
    doc = fitz.open(input_path)
    original_len = sum(len(p.get_text()) for p in doc)

    # Find References page - stop redacting there
    references_page = None
    for i, page in enumerate(doc):
        if "references" in page.get_text().lower():
            references_page = i
            break

    for page_num, page in enumerate(doc):
        if references_page is not None and page_num >= references_page:
            continue  # Skip References section

        for pattern in patterns:
            # ONLY redact exact search matches
            for rect in page.search_for(pattern):
                page.add_redact_annot(rect, fill=(0, 0, 0))
        page.apply_redactions()

    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    doc.save(output_path)
    doc.close()

    # MUST verify after saving
    verify_redaction(input_path, output_path)
```

## REQUIRED: Verification Function

**Always run this after ANY redaction to catch errors early:**

```python
import fitz

def verify_redaction(original_path, output_path):
    """Verify redaction didn't corrupt the PDF."""
    orig = fitz.open(original_path)
    redc = fitz.open(output_path)

    orig_len = sum(len(p.get_text()) for p in orig)
    redc_len = sum(len(p.get_text()) for p in redc)

    print(f"Original: {len(orig)} pages, {orig_len} chars")
    print(f"Redacted: {len(redc)} pages, {redc_len} chars")
    print(f"Retained: {redc_len/orig_len:.1%}")

    # DEFENSIVE CHECKS - fail fast if something went wrong
    if len(redc) != len(orig):
        raise ValueError(f"Page count changed: {len(orig)} -> {len(redc)}")
    if redc_len < 1000:
        raise ValueError(f"PDF corrupted: only {redc_len} chars remain!")
    if redc_len < orig_len * 0.7:
        raise ValueError(f"Too much removed: kept only {redc_len/orig_len:.0%}")

    orig.close()
    redc.close()
    print("✓ Verification passed")
```

Overview

This skill redacts identifying text from academic PDF manuscripts to prepare them for blind review. It targets exact matches for names, affiliations, emails, venues, arXiv IDs, DOIs, acknowledgement names, and equal-contribution notes while preserving the References section.

How this skill works

The skill scans each PDF page up to the References section and searches for full-text patterns provided by the user. It applies redaction annotations only to exact search matches and writes a new PDF, then runs automated verification to ensure the document structure and most text remain intact. Defensive checks ensure page counts match and at least ~70–80% of text is retained to catch over-redaction or corruption.

When to use it

Preparing submissions for double-blind peer review
Removing personal identifiers before public preprint posting
Anonymizing author metadata for internal review workflows
Batch-processing many manuscripts while preserving citation integrity

Best practices

Always supply full-name patterns (e.g., "John Smith"), never single surnames or partial tokens
Stop redaction at the References section to avoid altering citations
Avoid page- or region-level redaction; only redact exact search matches
Run the verification step after saving and fail if page counts differ or retained text falls below ~70–80%
Keep a copy of the original PDF and log all redaction patterns applied

Example use cases

Redact author names and affiliations from a conference submission while leaving References untouched
Remove institutional email addresses and venue mentions before sharing a draft with external reviewers
Automate anonymization for a folder of manuscripts using a standard pattern list
Check redaction results programmatically and raise alerts if output looks corrupted or over-redacted

FAQ

Will the References section be altered?

No. The workflow detects the first page containing "References" (case-insensitive) and skips redaction on that page and all following pages to preserve citations and self-references.

How do you prevent removing too much text?

The tool only redacts exact search matches. After saving it verifies page count and retained character ratio, raising an error if pages changed or retained text drops below configured thresholds (~70–80%).

Can partial names like a surname be redacted?

Do not use partial names. Only full names or exact phrases should be provided to avoid accidental removal of legitimate citations or mentions.