home / skills / kthorn / research-superpower / cleaning-up-research-sessions

cleaning-up-research-sessions skill

safe

/skills/research/cleaning-up-research-sessions

This skill safely remove intermediate files from completed research sessions while preserving core outputs and essential data for reproducibility.

npx playbooks add skill kthorn/research-superpower --skill cleaning-up-research-sessions

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

11.3 KB

---
name: Cleaning Up Research Sessions
description: Safely remove intermediate files from completed research sessions while preserving important data
when_to_use: After research session is complete and consolidated. When research folder has accumulated temporary files. Before archiving or sharing research session.
version: 1.0.0
---

# Cleaning Up Research Sessions

## Overview

Remove intermediate files created during research workflow while preserving all important data.

**Core principle:** Conservative cleanup with user confirmation. Never delete anything important.

## When to Use

Use this skill when:
- Research session is complete and consolidated
- Preparing to archive or share research session folder
- Research folder has accumulated temporary/intermediate files
- User explicitly asks to clean up

**When NOT to use:**
- Research is still in progress
- User hasn't reviewed final outputs yet
- Unsure what files are safe to delete

## Files That Are ALWAYS KEPT

**NEVER delete these (protected list):**

**Core outputs:**
- `SUMMARY.md` - Enhanced findings with methodology
- `relevant-papers.json` - Filtered relevant papers
- `papers-reviewed.json` - Complete screening history
- `papers/` directory - All PDFs and supplementary files
- `citations/citation-graph.json` - Citation relationships

**Methodology documentation:**
- `screening-criteria.json` - Rubric definition (if exists)
- `test-set.json` - Rubric validation papers (if exists)
- `abstracts-cache.json` - Cached abstracts for re-screening (if exists)
- `rubric-changelog.md` - Rubric version history (if exists)

**Auxiliary documentation (if exists):**
- `README.md` - Project overview
- `TOP_PRIORITY_PAPERS.md` - Curated priority list
- `evaluated-papers.json` - Rich structured data

**Project configuration:**
- `.claude/` directory - Permissions and settings
- `*.py` helper scripts that were created - Keep for reproducibility

## Files That May Be Cleaned Up

**Candidates for removal (with confirmation):**

**Intermediate search results:**
- `initial-search-results.json` - Raw PubMed results before screening
  - Safe to delete: Data is in papers-reviewed.json
  - Reason to keep: Shows raw search results for reproducibility

**Temporary files:**
- `*.tmp` files
- `*.swp` files (vim swap files)
- `.DS_Store` (macOS)
- `__pycache__/` (Python cache)
- `*.pyc` (Python compiled)

**Log files:**
- `*.log` files
- `debug-*.txt` files

## Cleanup Workflow

### Step 1: Analyze Research Session

```bash
cd research-sessions/YYYY-MM-DD-description/

# List all files with sizes
find . -type f -exec ls -lh {} \; | awk '{print $5, $9}' | sort -rh
```

**Identify files by category:**
- Core outputs (MUST keep)
- Methodology files (SHOULD keep)
- Intermediate files (candidates for cleanup)
- Temporary files (safe to delete)

### Step 2: Present Cleanup Plan to User

**Show what will be deleted:**

```
🧹 Cleanup Analysis for: research-sessions/2025-10-11-btk-selectivity/

Files to KEEP (protected):
  ✅ SUMMARY.md (45 KB)
  ✅ relevant-papers.json (12 KB)
  ✅ papers-reviewed.json (28 KB)
  ✅ papers/ (14 PDFs, 32 MB)
  ✅ citations/citation-graph.json (5 KB)
  ✅ screening-criteria.json (2 KB)
  ✅ abstracts-cache.json (156 KB)

Files that CAN be removed (intermediate):
  🗑️  initial-search-results.json (8 KB) - Raw PubMed results
  🗑️  .DS_Store (6 KB) - macOS metadata

Total space to recover: 14 KB

Proceed with cleanup? (y/n/review)
```

**Options:**
- `y` - Delete intermediate files
- `n` - Cancel cleanup, keep everything
- `review` - Show contents of each file before deciding

### Step 3: Confirm Deletions

**Before deleting ANY file:**

1. **Verify it's not in protected list**
2. **Check file isn't referenced in SUMMARY.md**
3. **Confirm with user one more time**

**Example confirmation:**
```
About to delete:
- initial-search-results.json (8 KB)

This file contains raw PubMed search results. The data is preserved in
papers-reviewed.json, so this is safe to delete.

Confirm deletion? (y/n)
```

### Step 4: Perform Cleanup

**Delete confirmed files:**

```bash
# Move to trash instead of rm (safer)
# On macOS:
mv initial-search-results.json ~/.Trash/

# On Linux:
mv initial-search-results.json ~/.local/share/Trash/files/

# Or use rm if user confirms
rm initial-search-results.json
```

**Report results:**
```
✅ Cleanup complete!

Removed:
- initial-search-results.json (8 KB)
- .DS_Store (6 KB)

Space recovered: 14 KB

Protected files preserved:
- All 8 core files kept
- All 14 PDFs kept
- All methodology documentation kept
```

### Step 5: Verify Integrity

**After cleanup, verify critical files:**

```bash
# Check core files exist
test -f SUMMARY.md && echo "✓ SUMMARY.md"
test -f relevant-papers.json && echo "✓ relevant-papers.json"
test -f papers-reviewed.json && echo "✓ papers-reviewed.json"
test -d papers && echo "✓ papers/ directory"

# Verify JSON files are valid
jq empty relevant-papers.json && echo "✓ relevant-papers.json valid JSON"
jq empty papers-reviewed.json && echo "✓ papers-reviewed.json valid JSON"
```

**Report to user:**
```
✅ Integrity check passed
   - All core files present
   - All JSON files valid
   - All PDFs intact
```

## Special Cases

### Case 1: Large abstracts-cache.json

**If abstracts-cache.json is very large (>100 MB):**

```
⚠️  abstracts-cache.json is 256 MB

This file enables re-screening if you update the rubric. Options:
1. Keep (recommended if you might refine rubric)
2. Compress (gzip to ~50 MB, can decompress later)
3. Delete (only if research is final and won't be updated)

Choice? (1/2/3)
```

**If user chooses compress:**
```bash
gzip abstracts-cache.json
# Creates abstracts-cache.json.gz

echo "Compressed abstracts-cache.json to $(du -h abstracts-cache.json.gz | cut -f1)"
```

### Case 2: Helper Scripts

**If user created helper scripts during research:**

```
📝 Found helper scripts:
   - screen_papers.py (created for batch screening)
   - deep_dive_papers.py (created for data extraction)

These scripts document your methodology. Recommendations:
- Keep for reproducibility
- Add comments if not already documented
- Reference in SUMMARY.md under "Reproducibility" section

Keep scripts? (y/n)
```

### Case 3: Multiple Research Sessions

**If cleaning up multiple sessions:**

```bash
# Find all research sessions
find research-sessions/ -maxdepth 1 -type d

# For each session:
for session in research-sessions/*/; do
    echo "Analyzing: $session"
    # Run cleanup analysis
done
```

**Ask user:**
```
Found 5 completed research sessions.

Clean up all sessions? (y/n/select)
- y: Analyze and clean all sessions
- n: Cancel
- select: Choose which sessions to clean
```

## Safety Mechanisms

### Protected File List

**Maintain hardcoded list of patterns to NEVER delete:**

```python
PROTECTED_PATTERNS = [
    'SUMMARY.md',
    'relevant-papers.json',
    'papers-reviewed.json',
    'papers/*.pdf',
    'papers/*.zip',
    'citations/citation-graph.json',
    'screening-criteria.json',
    'test-set.json',
    'abstracts-cache.json',
    'rubric-changelog.md',
    'README.md',
    'TOP_PRIORITY_PAPERS.md',
    'evaluated-papers.json',
    '*.py',  # Helper scripts
    '.claude/*',  # Project settings
]
```

**Before deleting any file:**
```python
def is_protected(filepath):
    """Check if file matches any protected pattern"""
    for pattern in PROTECTED_PATTERNS:
        if fnmatch(filepath, pattern):
            return True
    return False

# Never delete protected files
if is_protected(file_to_delete):
    print(f"⚠️  ERROR: {file_to_delete} is protected and cannot be deleted")
    return
```

### Dry Run Mode

**Always show what will be deleted before doing it:**

```bash
# Dry run (show only, don't delete)
echo "DRY RUN - No files will be deleted"

for file in $candidate_files; do
    if is_safe_to_delete "$file"; then
        echo "Would delete: $file ($(du -h $file | cut -f1))"
    fi
done

echo ""
echo "Proceed with actual deletion? (y/n)"
```

## Integration with Other Skills

**After answering-research-questions workflow:**

1. Complete Phase 8 (consolidation)
2. User reviews SUMMARY.md and relevant-papers.json
3. **Optionally**: Run cleaning-up-research-sessions
4. Archive or share research folder

**Add to answering-research-questions Phase 8:**
```markdown
### Optional: Cleanup

After reviewing outputs, optionally clean up intermediate files:

"Research session is complete. Would you like me to clean up intermediate files?
I'll show you what will be deleted before removing anything."

If yes: Use `cleaning-up-research-sessions` skill
```

## Common Mistakes

**Deleting papers-reviewed.json:** This is the deduplication database - NEVER delete → Always protect with hardcoded list
**Deleting abstracts-cache.json:** Needed for re-screening → Ask user, default to keep
**Deleting helper scripts:** Important for reproducibility → Keep by default, ask if user wants to remove
**Not showing user what will be deleted:** User needs to see the plan → Always show dry run first
**Deleting without confirmation:** Too risky → Always ask for final confirmation
**Not verifying after cleanup:** Files could be corrupted → Run integrity check after deletion

## Quick Reference

| Action | Command |
|--------|---------|
| List all files with sizes | `find . -type f -exec ls -lh {} \; \| sort -rh` |
| Check if JSON valid | `jq empty file.json` |
| Move to trash (macOS) | `mv file ~/.Trash/` |
| Move to trash (Linux) | `mv file ~/.local/share/Trash/files/` |
| Compress large file | `gzip file` |
| Calculate folder size | `du -sh .` |

## Success Criteria

Cleanup successful when:
- All protected files still exist and are valid
- Only confirmed intermediate files deleted
- User knows what was deleted and why
- Space recovered is reported
- Integrity check passed
- Research session is still fully functional

## Example Session

```
User: "Can you clean up the research session folder?"

Agent: "I'll analyze the research session and show you what can be safely cleaned up."

[Analyzes folder]

Agent: "🧹 Cleanup Analysis for: research-sessions/2025-10-11-btk-selectivity/

Files to KEEP (protected):
  ✅ SUMMARY.md (45 KB) - Final findings
  ✅ relevant-papers.json (12 KB) - Filtered results
  ✅ papers-reviewed.json (28 KB) - Complete history
  ✅ papers/ (14 PDFs, 32 MB) - Downloaded papers
  ✅ citations/citation-graph.json (5 KB) - Citation network
  ✅ screening-criteria.json (2 KB) - Rubric definition
  ✅ abstracts-cache.json (156 KB) - Cached abstracts

Files that CAN be removed:
  🗑️  initial-search-results.json (8 KB) - Raw PubMed results
     → Data is preserved in papers-reviewed.json
  🗑️  .DS_Store (6 KB) - macOS metadata
     → Not needed

Total space to recover: 14 KB

Proceed with cleanup? (y/n/review)"

User: "y"

Agent: "Final confirmation - About to delete:
- initial-search-results.json (8 KB)
- .DS_Store (6 KB)

All protected files will be preserved. Confirm? (y/n)"

User: "y"

Agent: [Deletes files]

"✅ Cleanup complete!

Removed:
- initial-search-results.json (8 KB)
- .DS_Store (6 KB)

Space recovered: 14 KB

Running integrity check...
✅ All core files intact
✅ All JSON files valid
✅ All 14 PDFs present

Research session is ready for archiving or sharing."
```

## Next Steps

After cleanup:
- Research folder is clean and ready to archive
- Share folder with collaborators
- Move to long-term storage
- Continue with follow-up research if needed

Overview

This skill safely removes intermediate and temporary files from completed research sessions while preserving core outputs and reproducibility artifacts. It follows a conservative, user-confirmed workflow: analyze files, present a dry-run plan, require explicit confirmations, and verify integrity after cleanup. The priority is never to delete important data.

How this skill works

The skill scans a session folder, classifies files into protected outputs, methodology docs, intermediates, and temp files, and estimates recoverable space. It presents a dry-run report showing which files are safe candidates and why, asks the user to review or inspect contents, then requests final confirmation before moving files to trash or deleting. After deletion it runs integrity checks on protected files and reports results.

When to use it

When a research session is complete and consolidated
Before archiving or sharing a session folder
When temporary/intermediate files have accumulated and you want to reduce clutter
When the user explicitly requests cleanup
When preparing multiple sessions for batch cleanup

Best practices

Always run a dry-run first and show the list of candidate deletions to the user
Keep protected core outputs: SUMMARY.md, relevant-papers.json, papers-reviewed.json, papers/, citation graphs, and helper scripts
Confirm any action that affects abstracts-cache.json or other large files (offer compress instead of delete)
Move files to system Trash rather than immediate rm when possible for easy recovery
Run JSON validation and file existence checks after cleanup to ensure integrity

Example use cases

Clean a single completed session folder before sharing with collaborators
Batch-analyze multiple dated research-sessions directories and select a subset to clean
Compress a very large abstracts-cache.json to save space while retaining re-screening capability
Remove temporary editor swap and OS metadata files (.swp, .DS_Store, *.tmp) after final review
Move initial raw search result files to trash after confirming their data is preserved in papers-reviewed.json

FAQ

What files are never deleted?

Core outputs and reproducibility artifacts are protected by default: SUMMARY.md, relevant-papers.json, papers-reviewed.json, the papers/ directory and PDFs, citation graphs, screening criteria, helper scripts (*.py), and project settings. The skill will refuse to delete these.

What if abstracts-cache.json is huge?

Offer options: keep, compress (gzip), or delete only after explicit confirmation. Default recommendation is to keep or compress to preserve re-screening capability.

How do you recover deleted files?

By default the skill moves files to the system Trash on macOS or Linux when available; advise users to restore from Trash or their backup if recovery is needed.