home / skills / plurigrid / asi / compositional-acset-comparison

compositional-acset-comparison skill

/skills/compositional-acset-comparison

This skill helps you compare compositional ACSet schemas across DuckDB and LanceDB to reveal density, versioning, and traversal trade-offs.

npx playbooks add skill plurigrid/asi --skill compositional-acset-comparison

Review the files below or copy the command above to add this skill to your agents.

Files (25)
SKILL.md
12.9 KB
---
name: compositional-acset-comparison
description: "Compositional algorithm and data analysis via algebraic databases"
license: UNLICENSED
metadata:
  trit: 0
---

# Compositional ACSet Comparison Skill

**Trit**: 0 (ERGODIC - Coordinator)
**Color**: #26D826 (Green)
**Domain**: Compositional algorithm/data analysis via algebraic databases

## Homoiconic Insight

In self-hosted Lisps, the boundary between data structures and algorithms dissolves:
- Code is data, data is code (homoiconicity)
- Evaluation time is phase-scoped (RED/BLUE/GREEN gadgets)
- Entanglement avoided by leaving phases open until explicitly closed
- Compositional structure preserved across algorithm ↔ data boundary

## Overview

Compare data structures and their properties (density/sparsity, dynamic/static, versioning strategies) using the richness afforded by ACSets. Uses Gay.jl-aided superrandom walks for deterministic exploration of comparison dimensions.

## Canonical Triads

```
schema-validation (-1) ⊗ compositional-acset-comparison (0) ⊗ gay-mcp (+1) = 0 ✓  [Property Analysis]
three-match (-1) ⊗ compositional-acset-comparison (0) ⊗ koopman-generator (+1) = 0 ✓  [Dynamic Traversal]
temporal-coalgebra (-1) ⊗ compositional-acset-comparison (0) ⊗ oapply-colimit (+1) = 0 ✓  [Versioning]
polyglot-spi (-1) ⊗ compositional-acset-comparison (0) ⊗ gay-mcp (+1) = 0 ✓  [Homoiconic Interop]
```

## Golden Thread Walk Dimensions

Each dimension is explored via φ-angle (137.508°) golden spiral for maximal dispersion:

| Step | Dimension | Hex Color | Hue |
|------|-----------|-----------|-----|
| 1 | Storage Hierarchy | #EE2B2B | 0° |
| 2 | Density/Sparsity | #2BEE64 | 137.51° |
| 3 | Dynamic/Static | #9D2BEE | 275.02° |
| 4 | Versioning Strategy | #EED52B | 52.52° |
| 5 | Traversal Patterns | #2BCDEE | 190.03° |
| 6 | Index Structures | #EE2B94 | 327.54° |
| 7 | Compression | #5BEE2B | 105.05° |
| 8 | Query Model | #332BEE | 242.55° |
| 9 | Embedding Support | #EE6C2B | 20.06° |
| 10 | Interoperability | #2BEEA5 | 157.57° |
| 11 | Concurrency | #DE2BEE | 295.08° |
| 12 | Memory Model | #C5EE2B | 72.59° |

## Comparison Matrix: DuckDB vs LanceDB

### Dimension 1: Storage Hierarchy (#EE2B2B)

```
DuckDB                          LanceDB
──────                          ───────
Table                           Database
  └─RowGroup (122K rows)          └─Table
      └─Column                        └─Manifest (version)
          └─Segment                       └─Fragment
              └─Block                         └─Column
                                                  └─VectorColumn
```

**ACSet Morphism Depth**:
- DuckDB: 4 levels (Table→RowGroup→Column→Segment)
- LanceDB: 5 levels (Database→Table→Manifest→Fragment→Column)

### Dimension 2: Density/Sparsity (#2BEE64)

| Property | DuckDB | LanceDB |
|----------|--------|---------|
| **Default** | Dense columnar | Dense Arrow arrays |
| **Sparse Support** | Via NULL bitmask | Via Arrow validity bitmask |
| **Vector Sparsity** | N/A | Sparse via IVF partitioning |
| **Storage Efficiency** | ALP, ZSTD compression | Lance columnar format |
| **ACSet Rep** | `DenseFinColumn` | `DenseFinColumn` with `VectorColumn` extension |

**Density Formula**:
```julia
density(acset, obj) = nparts(acset, obj) / theoretical_max(acset, obj)
# DuckDB Segment: ~2048 rows per vector batch
# LanceDB Fragment: variable, optimized for vector search
```

### Dimension 3: Dynamic/Static (#9D2BEE)

| Property | DuckDB | LanceDB |
|----------|--------|---------|
| **Schema Evolution** | ALTER TABLE | Manifest versioning |
| **Row Updates** | In-place (TRANSIENT→PERSISTENT) | Append + compaction |
| **Index Updates** | Dynamic B-Tree/ART | Rebuild IVF partitions |
| **ACSet Mutation** | `set_subpart!`, `rem_part!` | Append-only, version chains |

**State Machine**:
```
DuckDB Segment: TRANSIENT ⟷ PERSISTENT (bidirectional)
LanceDB Manifest: V1 → V2 → V3 → ... (append-only chain)
```

### Dimension 4: Versioning Strategy (#EED52B) ⭐ Lance SDK 1.0.0

**Critical Update (December 15, 2025)**: Lance SDK adopts SemVer 1.0.0

| Component | Versioning | Strategy |
|-----------|------------|----------|
| **Lance SDK** | SemVer 1.0.0 | MAJOR.MINOR.PATCH |
| **Lance File Format** | 2.1 | Binary compatibility, independent |
| **Lance Table Format** | Feature flags | Full backward compat, no linear versions |
| **Lance Namespace Spec** | Per-operation | Iceberg REST Catalog style |

**Key Insight**: Breaking SDK changes will NOT invalidate existing Lance data.

```julia
# ACSet representation of versioning strategies
@present SchVersioning(FreeSchema) begin
  SDKVersion::Ob      # SemVer (1.0.0)
  FileFormat::Ob      # Binary compat (2.1)
  TableFormat::Ob     # Feature flags
  NamespaceSpec::Ob   # Per-operation
  
  # Morphisms: SDK ≠ Format
  sdk_file::Hom(SDKVersion, FileFormat)      # Many-to-one
  file_table::Hom(FileFormat, TableFormat)   # Independent
  table_ns::Hom(TableFormat, NamespaceSpec)  # Independent
end
```

**DuckDB Versioning**:
- Temporal tables via `VERSION AT`
- Extension versioning separate from core

### Dimension 5: Traversal Patterns (#2BCDEE)

| Pattern | DuckDB | LanceDB |
|---------|--------|---------|
| **Sequential Scan** | RowGroup→Column→Segment | Fragment→Column |
| **Index Scan** | ART/B-Tree navigation | IVF partition probe |
| **Vector Search** | N/A (extension) | Centroid→Partition→Rows |
| **Time Travel** | `FOR SYSTEM_TIME AS OF` | `checkout(version)` |

**ACSet Incident Queries**:
```julia
# DuckDB: Find all segments in a column
incident(duckdb_acset, col_id, :column)

# LanceDB: Find all centroids for an index
incident(lancedb_acset, idx_id, :partition_index) |>
  flatmap(p -> incident(lancedb_acset, p, :centroid_partition))
```

### Dimension 6: Index Structures (#EE2B94)

| Index Type | DuckDB | LanceDB |
|------------|--------|---------|
| **Primary** | None (heap) | None (Lance format) |
| **Secondary** | ART (Radix Tree) | Scalar indexes |
| **Vector** | Extension (vss) | IVF_PQ, IVF_HNSW_SQ, IVF_HNSW_PQ |
| **Full-Text** | Extension (fts) | N/A |

**ACSet Index Representation**:
```julia
# LanceDB vector index hierarchy
VectorIndex → Partition → Centroid
    ↓
index_column → VectorColumn → Column
```

### Dimension 7: Compression (#5BEE2B)

| Algorithm | DuckDB | LanceDB |
|-----------|--------|---------|
| **Numeric** | ALP (Adaptive Lossless) | Arrow encoding |
| **String** | Dictionary, FSST | Dictionary |
| **General** | ZSTD, LZ4 | ZSTD |
| **Vector** | N/A | PQ (Product Quantization) |

### Dimension 8: Query Model (#332BEE)

| Aspect | DuckDB | LanceDB |
|--------|--------|---------|
| **Language** | SQL | Python/Rust API + SQL filter |
| **Optimization** | Volcano/push-based | Vector-first + filter |
| **Execution** | Vectorized (2048 batch) | Arrow RecordBatch |
| **Parallelism** | Morsel-driven | Partition-parallel |

### Dimension 9: Embedding Support (#EE6C2B)

| Feature | DuckDB | LanceDB |
|---------|--------|---------|
| **Native** | No | Yes (FixedSizeList<Float>) |
| **Generation** | UDF/Extension | EmbeddingFunction registry |
| **Storage** | ARRAY type | VectorColumn |
| **Search** | Extension (vss) | Native (IVF, HNSW) |

### Dimension 10: Interoperability (#2BEEA5)

| Format | DuckDB | LanceDB |
|--------|--------|---------|
| **Arrow** | Full support | Native (Lance = Arrow extension) |
| **Parquet** | Read/Write | Read (convert to Lance) |
| **CSV/JSON** | Read/Write | Via Arrow |
| **ACSets** | Via Tables.jl | Via Arrow → Tables.jl |

**Cross-Language (from ACSets Intertypes)**:
```julia
# Generate interoperable types
generate_module(DuckDBACSet, [PydanticTarget, JacksonTarget])
generate_module(LanceDBACSet, [PydanticTarget, JacksonTarget])
```

### Dimension 11: Concurrency (#DE2BEE)

| Aspect | DuckDB | LanceDB |
|--------|--------|---------|
| **Model** | MVCC | Optimistic (manifest-based) |
| **Writers** | Single (or WAL) | Single (append) |
| **Readers** | Unlimited concurrent | Unlimited concurrent |
| **Isolation** | Snapshot | Version snapshot |

### Dimension 12: Memory Model (#C5EE2B)

| Aspect | DuckDB | LanceDB |
|--------|--------|---------|
| **Buffer Pool** | BufferManager | Memory-mapped Arrow |
| **Eviction** | LRU | OS page cache |
| **Allocation** | Unified allocator | Arrow allocator |
| **Out-of-Core** | Automatic spill | Lazy loading |

## Interleaved 3-Stream Comparison

Using GF(3) conservation for balanced parallel analysis:

```
Stream 1 (Blue, -1): Validation/Constraints
  #31945E → #B3DA86 → #8810F2 → #2F5194 → #2452AA → #245FB4

Stream 2 (Green, 0): Coordination/Transport
  #6D59D2 → #9E2981 → #72E24F → #31C5B4 → #C04DDD → #1C8EEE

Stream 3 (Red, +1): Generation/Composition
  #E22FA7 → #E812C8 → #6F68E6 → #25D840 → #DA387F → #A82358
```

## Crystal Family Analogy

Data structures map to crystal symmetry:

| Crystal Family | Symmetry | DuckDB Analog | LanceDB Analog |
|----------------|----------|---------------|----------------|
| Cubic (#9E94DD) | Order 48 | RowGroup uniformity | Fragment uniformity |
| Hexagonal (#65F475) | Order 24 | Column types | Vector dimensions |
| Tetragonal (#E764F1) | Order 16 | Segment blocking | Partition structure |
| Orthorhombic (#2ADC56) | Order 8 | Type system | Index types |
| Monoclinic (#CD7B61) | Order 4 | Compression | Quantization |
| Triclinic (#E4338F) | Order 2 | Raw storage | Raw Arrow |

## Hierarchical Control Palette

Powers PCT cascade for harmonious comparison:

```
Level 5 (Program): "Compare DuckDB vs LanceDB"
    ↓ sets reference for
Level 4 (Transition): Dimension sequence [30° steps]
    ↓ sets reference for
Level 3 (Configuration): Property relationships
    ↓ sets reference for
Level 2 (Sensation): Individual metrics
    ↓ sets reference for
Level 1 (Intensity): Numeric values
```

Colors: #B322C0 → #D5268C → #DC3946 → #DF884A → #E0D551 → #A3E04E

## XY Model Phenomenology

At τ=0.5 (ordered phase, τ < τ_c=0.893):
- Smooth field, defects bound in pairs
- High valence, disentangled
- Antivortex at (4,3): #C33567

**Interpretation**: Both DuckDB and LanceDB are in "ordered phase" - mature, production-ready systems with well-defined structures.

## Usage

```julia
using ACSets, Catlab

# Load both schemas
include("DuckDBACSet.jl")
include("LanceDBACSet.jl")

# Compare morphism structures
compare_schemas(SchDuckDB, SchLanceDB)

# Analyze density
density_analysis = map([SchDuckDB, SchLanceDB]) do sch
  Dict(ob => sparsity_metric(sch, ob) for ob in obs(sch))
end

# Traverse with Gay.jl colors
for (i, dimension) in enumerate(DIMENSIONS)
  color = gay_color_at(1000000, i)
  analyze_dimension(dimension, color)
end
```

## Skill Files

| File | Purpose | Gay.jl Seed |
|------|---------|-------------|
| `DuckDBACSet.jl` | Schema for DuckDB storage layer | 1000000 |
| `LanceDBACSet.jl` | Schema for LanceDB vector store | 1000000 |
| `IrreversibleMorphisms.jl` | Analysis of lossy morphisms | 2000000 |
| `SideBySideComparison.jl` | Visual comparison tables | 3000000 |
| `ComparisonUtils.jl` | 12-dimension comparison utilities | 1000000 |
| `GhristCoverage.jl` | Persistent homology coverage analysis | 4000000 |
| `ColoringFunctor.jl` | Schema coloring + GF(3) verification | 4000000 |
| `GeometricMorphism.jl` | Presheaf topos translation analysis | 4000000 |

## Ghrist Persistent Homology Integration

Based on de Silva & Ghrist "Coverage in Sensor Networks via Persistent Homology":

**AM Radio Coverage Analogy**:
- Radio stations = Schema objects (Table, Column, etc.)
- Coverage radius = Morphism composability range
- Signal overlap = Translatable concepts between schemas
- Dead zones = Irreversible information loss

**Betti Numbers for Schemas**:
- β₀: Connected components (isolated subsystems)
- β₁: Coverage holes (information flow gaps)
- β₂: Enclosed voids (unreachable regions)

**Persistent Holes (never die)**:
- 🔴 `parent_manifest`: Temporal irreversibility (version chain)
- 🔴 `source_column`: Semantic irreversibility (embedding loss)

## Geometric Morphism Analysis

For presheaf topoi PSh(SchDuckDB) and PSh(SchLanceDB):

**Essential Image** (lossless translation):
- Table ↔ Table ✓
- Column ↔ Column ✓

**Partial Coverage** (lossy translation):
- RowGroup ~ Fragment
- VectorColumn → Column (loses vector semantics)

**Dead Zones** (no translation):
- Segment → ??? (DuckDB-only)
- Manifest ← ??? (LanceDB-only)
- VectorIndex ← ??? (LanceDB-only)

## References

- [de Silva & Ghrist, Coverage via Persistent Homology](https://www2.math.upenn.edu/~ghrist/preprints/persistent.pdf)
- [Lance SDK 1.0.0 Announcement](https://lancedb.github.io/lancedb/blog/announcing-lance-sdk-1.0.0/) (December 15, 2025)
- [DuckDB Architecture](https://duckdb.org/internals/overview)
- [ACSets.jl Documentation](https://algebraicjulia.github.io/ACSets.jl/)
- [StructuredDecompositions.jl](https://github.com/AlgebraicJulia/StructuredDecompositions.jl)
- [Gay.jl Deterministic Colors](https://github.com/bmorphism/Gay.jl)

Overview

This skill compares compositional data structures and runtime properties using algebraic condition-sets (ACSets) to produce a structured, multidimensional analysis. It focuses on practical differences across storage hierarchy, density, versioning, traversal, indexing, and embedding support. Results are deterministic and reproducible using golden-thread walk sampling and Gay.jl color seeds.

How this skill works

The skill models storage engines as ACSet schemas and computes morphism relationships, incidents, and Betti-number style coverage metrics to reveal information flow and irreversible mappings. Each comparison dimension is explored programmatically (storage, sparsity, dynamic/static, versioning, traversal, indexes, compression, query model, embeddings, interoperability, concurrency, memory). Outputs include ACSet representations, traversal patterns, and persistent-homology indicators for translation loss.

When to use it

  • Choosing between a columnar OLAP engine and a vector-first store for mixed analytics + search workloads
  • Evaluating schema evolution and versioning strategies for long-lived datasets
  • Designing interoperability paths when embedding vectors must interoperate with relational data
  • Auditing where semantic information is lost when translating between storage systems
  • Benchmarking storage and traversal trade-offs for hybrid analytic + embedding pipelines

Best practices

  • Represent each storage engine as an explicit ACSet schema before comparison to make morphisms first-class
  • Explore all 12 canonical dimensions; avoid optimizing a single metric in isolation
  • Use deterministic seeds (Gay.jl) for reproducible sampling across dimensions
  • Treat vector semantics as first-class: prefer VectorColumn-aware ACSet nodes when embeddings matter
  • Map versioning strategies explicitly (append-only vs in-place) to preserve temporal guarantees

Example use cases

  • Compare DuckDB-style rowgrouped columnar layout with LanceDB-style fragment + manifest vector store to decide storage layout
  • Detect translation dead zones where vector indexes or manifests cannot be losslessly mapped to a relational schema
  • Quantify density/sparsity trade-offs and expected compression for numeric and vector data across engines
  • Plan migration that preserves version semantics by mapping SDK, file, and table format morphisms
  • Generate interoperable type bindings (e.g., Pydantic/Jackson) from ACSet schemas for cross-language integration

FAQ

Does this skill require running both systems to compare them?

No. The comparison operates on ACSet schemas and morphisms that model structural and semantic properties; runtime traces are optional for deeper perf analysis.

Can it identify irreversible information loss between systems?

Yes. It flags dead zones and persistent homology features that indicate irreversible mappings, such as vector-index semantics lost when translating to plain columns.