home / skills / amnadtaowsoam / cerebraskills / data-retention-archiving
This skill helps you design and implement data retention and archiving policies to reduce costs, protect compliance, and optimize performance.
npx playbooks add skill amnadtaowsoam/cerebraskills --skill data-retention-archivingReview the files below or copy the command above to add this skill to your agents.
---
name: Data Retention and Archiving
description: Strategies for managing the lifecycle of data to balance storage costs, system performance, and legal compliance.
---
# Data Retention and Archiving
## Overview
Data is an asset, but after a certain point, it becomes a liability. **Retention** defines how long you keep data available in primary systems. **Archiving** is the process of moving infrequently accessed data to secondary, low-cost storage.
**Core Principle**: "Delete data you don't need; archive data you might need; protect data you must keep."
---
## 1. Retention vs. Archiving
| Feature | Data Retention | Data Archiving |
| :--- | :--- | :--- |
| **Purpose** | Compliance and daily operations. | Cost reduction and performance. |
| **Location** | Primary Database (SSD, Hot). | Cold Storage (Glacier, Deep Archive). |
| **Access Time** | Milliseconds/Seconds. | Minutes to 12+ Hours. |
| **Integrity** | High (Primary system). | Extremely High (Immutable storage). |
---
## 2. Compliance and Legal Requirements
Different jurisdictions and industries have mandatory minimum retention periods:
* **GDPR (Right to be Forgotten)**: Must delete personal data when no longer needed or on request.
* **HIPAA (Healthcare)**: Typically 6-7 years.
* **SOX (Financial)**: 7 years.
* **Employment Records**: 3-7 years.
### The "Legal Hold"
In the event of litigation, you must be able to suspend automated deletion policies for specific records immediately.
---
## 3. Storage Tiering and Cost Optimization
Move data to cheaper storage as it ages.
| Tier | AWS Service | GCP Service | Typical Use Case | Cost per GB (approx) |
| :--- | :--- | :--- | :--- | :--- |
| **Hot** | S3 Standard | Standard | Active app data. | $0.023 |
| **Cool** | S3 IA | Nearline | Monthly reports. | $0.0125 |
| **Cold** | Glacier | Coldline | Annual audits. | $0.004 |
| **Frozen** | Deep Archive | Archive | Legal requirements (7-10 yrs) | $0.00099 |
---
## 4. Implementation Strategies
### A. TTL (Time To Live)
Automatically delete records after a set duration.
* **DynamoDB/Redis**: Direct support for TTL fields.
* **MongoDB**: TTL indexes.
### B. Partition Dropping (Databases)
For large tables (logs, transactions), partition by date (e.g., `orders_2023_01`).
* **Action**: Dropping a partition is `O(1)` and much faster/safer than running a massive `DELETE` statement which locks the table.
```sql
-- PostgreSQL example: Drop old data
DROP TABLE IF EXISTS events_p2022_12;
```
### C. Archiving to Apache Parquet
Primary DB (SQL) is expensive for historical reads.
1. Export 2-year-old data to **Parquet** files on S3.
2. Parquet is compressed and columnar (80-90% smaller than SQL storage).
3. Query via **AWS Athena** or **BigQuery Omni** if needed.
---
## 5. Deletion Strategies: Hard vs. Soft
* **Soft Delete**: `is_deleted = true`. Data stays in the DB.
- *Pros*: Easy to undo.
- *Cons*: Doesn't save cost; DB continues to grow. Doesn't meet GDPR deletion requirements.
* **Hard Delete**: `DELETE FROM table`. Permanently removed.
- *Pros*: Real cost savings; compliance met.
- *Cons*: Irreversible; can be slow on large tables.
---
## 6. Automated Lifecycle Policies
Don't rely on human memory. Use provider policies.
### AWS S3 Lifecycle Configuration
```json
{
"Rules": [
{
"ID": "MoveOldLogs",
"Status": "Enabled",
"Prefix": "logs/",
"Transitions": [
{ "Days": 90, "StorageClass": "GLACIER" }
],
"Expiration": { "Days": 2555 } // Delete after 7 years
}
]
}
```
---
## 7. Data Anonymization before Archiving
If you must keep data for analytics but don't need the identity:
1. **Masking**: Replace `[email protected]` with `j***@example.com`.
2. **Hashing**: `SHA256(user_id)`.
3. **Aggregation**: Store "Total Sales per Zip Code" instead of individual transactions.
---
## 8. Data Re-hydration Strategy
If a regulator asks for data from 5 years ago:
1. **Request**: Initiate retrieval from S3 Glacier.
2. **Wait**: 3–5 hours for "Bulk" or "Standard" retrieval.
3. **Validate**: Verify checksums.
4. **Export**: Provide data in requested format (CSV/JSON).
---
## 9. Data Retention Checklist
- [ ] **Inventory**: Do we have a list of all data types and their retention periods?
- [ ] **Automation**: Are lifecycle policies enabled for all S3/GCS buckets?
- [ ] **Backup vs. Archive**: Are we clear that backups are for recovery, not long-term storage?
- [ ] **Legal Hold**: Do we have a manual override for the deletion script?
- [ ] **Privacy**: Are we hard-deleting PII within 30 days of a user deletion request?
- [ ] **Cost**: Have we calculated the savings of moving "Cold" data out of RDS/SQL?
---
## Related Skills
- `42-cost-engineering/cloud-cost-models`
- `43-data-reliability/data-quality-monitoring`
- `40-system-resilience/disaster-recovery`
This skill explains practical strategies for managing the lifecycle of data to balance storage costs, system performance, and legal compliance. It clarifies when to retain, archive, or delete data and provides concrete tactics for implementing automated policies and safe deletion. The guidance targets engineers and data owners responsible for storage, compliance, and cost optimization.
The skill inspects data access patterns, regulatory requirements, and storage costs to recommend tiering, TTLs, partition strategies, and archiving formats. It shows how to implement lifecycle policies (e.g., S3 lifecycle rules), move historical data to columnar formats like Parquet, and handle legal holds and rehydration. Practical examples cover hard vs. soft deletes, anonymization before archiving, and automated policy enforcement.
Hard delete vs soft delete — which to choose?
Use soft delete when you need quick undo and auditability; use hard delete when you need cost savings and regulatory compliance. Combine both: soft delete plus scheduled hard delete after retention expires.
How do I handle legal holds?
Implement an override that suspends lifecycle and deletion jobs for specific records, log the hold with metadata, and require approval to release the hold.
What's the best archive format?
Parquet is ideal for analytics: columnar, compressed, and cost-efficient for long-term storage and query engines like Athena or BigQuery.