home / skills / amnadtaowsoam / cerebraskills / retention-archival

retention-archival skill

safe

/70-data-platform-governance/retention-archival

This skill helps manage data retention, archival, and deletion policies to cut costs, improve performance, and ensure GDPR and legal compliance.

npx playbooks add skill amnadtaowsoam/cerebraskills --skill retention-archival

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.1 KB

---
name: Retention and Archival
description: Policy and automation for data retention, archival, and deletion according to compliance requirements
---

# Retention and Archival

## Overview

Policy and automation for data retention (how long to keep data), archival (move to cold storage), and deletion (permanently remove)

## Why This Matters

- **Compliance**: GDPR, CCPA retention limits
- **Cost**: Cold storage is cheaper than hot storage
- **Performance**: Less data = faster queries
- **Legal**: Right to deletion

---

## Retention Policy

```yaml
# retention-policy.yaml
policies:
  - table: users
    retention: 7 years
    reason: Legal requirement
    archive_after: 2 years
    delete_after: 7 years
  
  - table: logs
    retention: 90 days
    reason: Operational needs
    archive_after: 30 days
    delete_after: 90 days
  
  - table: analytics_events
    retention: 2 years
    reason: Business analytics
    archive_after: 6 months
    delete_after: 2 years
```

---

## Automated Archival

```python
# Archive old data to S3
def archive_old_data(table_name: str, cutoff_days: int):
    cutoff_date = datetime.now() - timedelta(days=cutoff_days)
    
    # Export to S3
    query = f"""
        COPY (
            SELECT * FROM {table_name}
            WHERE created_at < '{cutoff_date}'
        )
        TO 's3://archive-bucket/{table_name}/{cutoff_date.year}/'
        WITH (FORMAT PARQUET, COMPRESSION GZIP)
    """
    db.execute(query)
    
    # Delete from hot storage
    db.execute(f"""
        DELETE FROM {table_name}
        WHERE created_at < '{cutoff_date}'
    """)
    
    print(f"Archived {table_name} data older than {cutoff_days} days")

# Schedule daily
schedule.every().day.at("02:00").do(archive_old_data, 'logs', 30)
```

---

## Deletion Policy

```python
# Right to deletion (GDPR)
def delete_user_data(user_id: str):
    """Delete all user data across all tables"""
    
    tables_with_user_data = [
        'users',
        'orders',
        'analytics_events',
        'audit_logs'
    ]
    
    for table in tables_with_user_data:
        db.execute(f"""
            DELETE FROM {table}
            WHERE user_id = '{user_id}'
        """)
    
    # Log deletion
    audit_log.write({
        'action': 'user_data_deletion',
        'user_id': user_id,
        'timestamp': datetime.now(),
        'tables_affected': tables_with_user_data
    })
    
    print(f"Deleted all data for user {user_id}")
```

---

## Lifecycle Management

```sql
-- Partition by date for easy archival
CREATE TABLE logs (
  id UUID,
  message TEXT,
  created_at TIMESTAMP
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE logs_2024_01 PARTITION OF logs
  FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

-- Drop old partitions (fast deletion)
DROP TABLE logs_2023_01;  -- Deletes entire month instantly
```

---

## Summary

**Retention:** How long to keep data

**Archival:** Move to cold storage (S3, Glacier)

**Deletion:** Delete according to policy or user request

**Automation:**
- Scheduled archival jobs
- Partition-based deletion
- Audit logging

**Compliance:**
- GDPR (right to deletion)
- Data retention limits
- Audit trails

Overview

This skill provides policy and automation for data retention, archival, and deletion to meet compliance, cost, and operational goals. It defines retention windows, archival triggers, and deletion flows so teams can consistently manage hot and cold data. The implementation includes scheduled archival jobs, partitioned storage strategies, and audit-aware deletion routines.

How this skill works

The skill inspects configured retention policies per table or data domain and runs automated workflows to archive records older than the configured cutoff to cold storage (e.g., S3/Glacier) in compressed Parquet format. It then removes archived records from hot storage or drops old partitions for fast deletion. For user-initiated deletion, it executes cross-table delete routines and writes audit entries for traceability.

When to use it

To enforce regulatory retention limits (GDPR, CCPA) across databases.
When you want to reduce hot storage costs by moving historical data to cold storage.
To speed up queries by removing or partitioning stale data.
When building automated right-to-deletion workflows for user requests.
To create auditable, repeatable archival and deletion operations.

Best practices

Define explicit retention and archive_after values per table or data domain.
Archive in compact, columnar formats (Parquet) with compression for cost and query efficiency.
Use partitioned tables or time-based partitions to enable fast bulk drops.
Schedule archival during low-traffic windows and monitor success/failure metrics.
Keep immutable audit logs of deletion actions to support compliance and incident review.

Example use cases

Archive application logs older than 30 days to S3 and delete them from the primary database daily at 02:00.
Move analytics_events older than six months to Glacier-class storage and retain metadata for two years.
Automate GDPR right-to-deletion by running a cross-table deletion routine and recording an audit entry.
Drop monthly partitions for logs older than retention to instantaneously reclaim space.
Implement a retention-policy.yaml to drive automated archival and deletion schedules across environments.

FAQ

How do I avoid accidental data loss?

Test archival and deletion workflows in a staging environment, maintain backups for a recovery window, and require multi-step approvals or soft-delete flags before permanent drops.

What storage formats are recommended for archived data?

Use columnar formats like Parquet with compression (GZIP/SNAPPY) for cost-effective storage and efficient downstream analytics.

How should I prove compliance?

Keep immutable audit logs of archival and deletion operations, retain policy versions, and export activity reports for audits.