home / skills / kthorn / research-superpower / cleaning-up-research-sessions
This skill safely remove intermediate files from completed research sessions while preserving core outputs and essential data for reproducibility.
npx playbooks add skill kthorn/research-superpower --skill cleaning-up-research-sessionsReview the files below or copy the command above to add this skill to your agents.
---
name: Cleaning Up Research Sessions
description: Safely remove intermediate files from completed research sessions while preserving important data
when_to_use: After research session is complete and consolidated. When research folder has accumulated temporary files. Before archiving or sharing research session.
version: 1.0.0
---
# Cleaning Up Research Sessions
## Overview
Remove intermediate files created during research workflow while preserving all important data.
**Core principle:** Conservative cleanup with user confirmation. Never delete anything important.
## When to Use
Use this skill when:
- Research session is complete and consolidated
- Preparing to archive or share research session folder
- Research folder has accumulated temporary/intermediate files
- User explicitly asks to clean up
**When NOT to use:**
- Research is still in progress
- User hasn't reviewed final outputs yet
- Unsure what files are safe to delete
## Files That Are ALWAYS KEPT
**NEVER delete these (protected list):**
**Core outputs:**
- `SUMMARY.md` - Enhanced findings with methodology
- `relevant-papers.json` - Filtered relevant papers
- `papers-reviewed.json` - Complete screening history
- `papers/` directory - All PDFs and supplementary files
- `citations/citation-graph.json` - Citation relationships
**Methodology documentation:**
- `screening-criteria.json` - Rubric definition (if exists)
- `test-set.json` - Rubric validation papers (if exists)
- `abstracts-cache.json` - Cached abstracts for re-screening (if exists)
- `rubric-changelog.md` - Rubric version history (if exists)
**Auxiliary documentation (if exists):**
- `README.md` - Project overview
- `TOP_PRIORITY_PAPERS.md` - Curated priority list
- `evaluated-papers.json` - Rich structured data
**Project configuration:**
- `.claude/` directory - Permissions and settings
- `*.py` helper scripts that were created - Keep for reproducibility
## Files That May Be Cleaned Up
**Candidates for removal (with confirmation):**
**Intermediate search results:**
- `initial-search-results.json` - Raw PubMed results before screening
- Safe to delete: Data is in papers-reviewed.json
- Reason to keep: Shows raw search results for reproducibility
**Temporary files:**
- `*.tmp` files
- `*.swp` files (vim swap files)
- `.DS_Store` (macOS)
- `__pycache__/` (Python cache)
- `*.pyc` (Python compiled)
**Log files:**
- `*.log` files
- `debug-*.txt` files
## Cleanup Workflow
### Step 1: Analyze Research Session
```bash
cd research-sessions/YYYY-MM-DD-description/
# List all files with sizes
find . -type f -exec ls -lh {} \; | awk '{print $5, $9}' | sort -rh
```
**Identify files by category:**
- Core outputs (MUST keep)
- Methodology files (SHOULD keep)
- Intermediate files (candidates for cleanup)
- Temporary files (safe to delete)
### Step 2: Present Cleanup Plan to User
**Show what will be deleted:**
```
๐งน Cleanup Analysis for: research-sessions/2025-10-11-btk-selectivity/
Files to KEEP (protected):
โ
SUMMARY.md (45 KB)
โ
relevant-papers.json (12 KB)
โ
papers-reviewed.json (28 KB)
โ
papers/ (14 PDFs, 32 MB)
โ
citations/citation-graph.json (5 KB)
โ
screening-criteria.json (2 KB)
โ
abstracts-cache.json (156 KB)
Files that CAN be removed (intermediate):
๐๏ธ initial-search-results.json (8 KB) - Raw PubMed results
๐๏ธ .DS_Store (6 KB) - macOS metadata
Total space to recover: 14 KB
Proceed with cleanup? (y/n/review)
```
**Options:**
- `y` - Delete intermediate files
- `n` - Cancel cleanup, keep everything
- `review` - Show contents of each file before deciding
### Step 3: Confirm Deletions
**Before deleting ANY file:**
1. **Verify it's not in protected list**
2. **Check file isn't referenced in SUMMARY.md**
3. **Confirm with user one more time**
**Example confirmation:**
```
About to delete:
- initial-search-results.json (8 KB)
This file contains raw PubMed search results. The data is preserved in
papers-reviewed.json, so this is safe to delete.
Confirm deletion? (y/n)
```
### Step 4: Perform Cleanup
**Delete confirmed files:**
```bash
# Move to trash instead of rm (safer)
# On macOS:
mv initial-search-results.json ~/.Trash/
# On Linux:
mv initial-search-results.json ~/.local/share/Trash/files/
# Or use rm if user confirms
rm initial-search-results.json
```
**Report results:**
```
โ
Cleanup complete!
Removed:
- initial-search-results.json (8 KB)
- .DS_Store (6 KB)
Space recovered: 14 KB
Protected files preserved:
- All 8 core files kept
- All 14 PDFs kept
- All methodology documentation kept
```
### Step 5: Verify Integrity
**After cleanup, verify critical files:**
```bash
# Check core files exist
test -f SUMMARY.md && echo "โ SUMMARY.md"
test -f relevant-papers.json && echo "โ relevant-papers.json"
test -f papers-reviewed.json && echo "โ papers-reviewed.json"
test -d papers && echo "โ papers/ directory"
# Verify JSON files are valid
jq empty relevant-papers.json && echo "โ relevant-papers.json valid JSON"
jq empty papers-reviewed.json && echo "โ papers-reviewed.json valid JSON"
```
**Report to user:**
```
โ
Integrity check passed
- All core files present
- All JSON files valid
- All PDFs intact
```
## Special Cases
### Case 1: Large abstracts-cache.json
**If abstracts-cache.json is very large (>100 MB):**
```
โ ๏ธ abstracts-cache.json is 256 MB
This file enables re-screening if you update the rubric. Options:
1. Keep (recommended if you might refine rubric)
2. Compress (gzip to ~50 MB, can decompress later)
3. Delete (only if research is final and won't be updated)
Choice? (1/2/3)
```
**If user chooses compress:**
```bash
gzip abstracts-cache.json
# Creates abstracts-cache.json.gz
echo "Compressed abstracts-cache.json to $(du -h abstracts-cache.json.gz | cut -f1)"
```
### Case 2: Helper Scripts
**If user created helper scripts during research:**
```
๐ Found helper scripts:
- screen_papers.py (created for batch screening)
- deep_dive_papers.py (created for data extraction)
These scripts document your methodology. Recommendations:
- Keep for reproducibility
- Add comments if not already documented
- Reference in SUMMARY.md under "Reproducibility" section
Keep scripts? (y/n)
```
### Case 3: Multiple Research Sessions
**If cleaning up multiple sessions:**
```bash
# Find all research sessions
find research-sessions/ -maxdepth 1 -type d
# For each session:
for session in research-sessions/*/; do
echo "Analyzing: $session"
# Run cleanup analysis
done
```
**Ask user:**
```
Found 5 completed research sessions.
Clean up all sessions? (y/n/select)
- y: Analyze and clean all sessions
- n: Cancel
- select: Choose which sessions to clean
```
## Safety Mechanisms
### Protected File List
**Maintain hardcoded list of patterns to NEVER delete:**
```python
PROTECTED_PATTERNS = [
'SUMMARY.md',
'relevant-papers.json',
'papers-reviewed.json',
'papers/*.pdf',
'papers/*.zip',
'citations/citation-graph.json',
'screening-criteria.json',
'test-set.json',
'abstracts-cache.json',
'rubric-changelog.md',
'README.md',
'TOP_PRIORITY_PAPERS.md',
'evaluated-papers.json',
'*.py', # Helper scripts
'.claude/*', # Project settings
]
```
**Before deleting any file:**
```python
def is_protected(filepath):
"""Check if file matches any protected pattern"""
for pattern in PROTECTED_PATTERNS:
if fnmatch(filepath, pattern):
return True
return False
# Never delete protected files
if is_protected(file_to_delete):
print(f"โ ๏ธ ERROR: {file_to_delete} is protected and cannot be deleted")
return
```
### Dry Run Mode
**Always show what will be deleted before doing it:**
```bash
# Dry run (show only, don't delete)
echo "DRY RUN - No files will be deleted"
for file in $candidate_files; do
if is_safe_to_delete "$file"; then
echo "Would delete: $file ($(du -h $file | cut -f1))"
fi
done
echo ""
echo "Proceed with actual deletion? (y/n)"
```
## Integration with Other Skills
**After answering-research-questions workflow:**
1. Complete Phase 8 (consolidation)
2. User reviews SUMMARY.md and relevant-papers.json
3. **Optionally**: Run cleaning-up-research-sessions
4. Archive or share research folder
**Add to answering-research-questions Phase 8:**
```markdown
### Optional: Cleanup
After reviewing outputs, optionally clean up intermediate files:
"Research session is complete. Would you like me to clean up intermediate files?
I'll show you what will be deleted before removing anything."
If yes: Use `cleaning-up-research-sessions` skill
```
## Common Mistakes
**Deleting papers-reviewed.json:** This is the deduplication database - NEVER delete โ Always protect with hardcoded list
**Deleting abstracts-cache.json:** Needed for re-screening โ Ask user, default to keep
**Deleting helper scripts:** Important for reproducibility โ Keep by default, ask if user wants to remove
**Not showing user what will be deleted:** User needs to see the plan โ Always show dry run first
**Deleting without confirmation:** Too risky โ Always ask for final confirmation
**Not verifying after cleanup:** Files could be corrupted โ Run integrity check after deletion
## Quick Reference
| Action | Command |
|--------|---------|
| List all files with sizes | `find . -type f -exec ls -lh {} \; \| sort -rh` |
| Check if JSON valid | `jq empty file.json` |
| Move to trash (macOS) | `mv file ~/.Trash/` |
| Move to trash (Linux) | `mv file ~/.local/share/Trash/files/` |
| Compress large file | `gzip file` |
| Calculate folder size | `du -sh .` |
## Success Criteria
Cleanup successful when:
- All protected files still exist and are valid
- Only confirmed intermediate files deleted
- User knows what was deleted and why
- Space recovered is reported
- Integrity check passed
- Research session is still fully functional
## Example Session
```
User: "Can you clean up the research session folder?"
Agent: "I'll analyze the research session and show you what can be safely cleaned up."
[Analyzes folder]
Agent: "๐งน Cleanup Analysis for: research-sessions/2025-10-11-btk-selectivity/
Files to KEEP (protected):
โ
SUMMARY.md (45 KB) - Final findings
โ
relevant-papers.json (12 KB) - Filtered results
โ
papers-reviewed.json (28 KB) - Complete history
โ
papers/ (14 PDFs, 32 MB) - Downloaded papers
โ
citations/citation-graph.json (5 KB) - Citation network
โ
screening-criteria.json (2 KB) - Rubric definition
โ
abstracts-cache.json (156 KB) - Cached abstracts
Files that CAN be removed:
๐๏ธ initial-search-results.json (8 KB) - Raw PubMed results
โ Data is preserved in papers-reviewed.json
๐๏ธ .DS_Store (6 KB) - macOS metadata
โ Not needed
Total space to recover: 14 KB
Proceed with cleanup? (y/n/review)"
User: "y"
Agent: "Final confirmation - About to delete:
- initial-search-results.json (8 KB)
- .DS_Store (6 KB)
All protected files will be preserved. Confirm? (y/n)"
User: "y"
Agent: [Deletes files]
"โ
Cleanup complete!
Removed:
- initial-search-results.json (8 KB)
- .DS_Store (6 KB)
Space recovered: 14 KB
Running integrity check...
โ
All core files intact
โ
All JSON files valid
โ
All 14 PDFs present
Research session is ready for archiving or sharing."
```
## Next Steps
After cleanup:
- Research folder is clean and ready to archive
- Share folder with collaborators
- Move to long-term storage
- Continue with follow-up research if needed
This skill safely removes intermediate and temporary files from completed research sessions while preserving core outputs and reproducibility artifacts. It follows a conservative, user-confirmed workflow: analyze files, present a dry-run plan, require explicit confirmations, and verify integrity after cleanup. The priority is never to delete important data.
The skill scans a session folder, classifies files into protected outputs, methodology docs, intermediates, and temp files, and estimates recoverable space. It presents a dry-run report showing which files are safe candidates and why, asks the user to review or inspect contents, then requests final confirmation before moving files to trash or deleting. After deletion it runs integrity checks on protected files and reports results.
What files are never deleted?
Core outputs and reproducibility artifacts are protected by default: SUMMARY.md, relevant-papers.json, papers-reviewed.json, the papers/ directory and PDFs, citation graphs, screening criteria, helper scripts (*.py), and project settings. The skill will refuse to delete these.
What if abstracts-cache.json is huge?
Offer options: keep, compress (gzip), or delete only after explicit confirmation. Default recommendation is to keep or compress to preserve re-screening capability.
How do you recover deleted files?
By default the skill moves files to the system Trash on macOS or Linux when available; advise users to restore from Trash or their backup if recovery is needed.