home / skills / keboola / ai-kit / dataapp-dev

dataapp-dev skill

/plugins/dataapp-developer/skills/dataapp-dev

This skill helps you develop robust Keboola Streamlit data apps by validating data structures, building SQL-first queries, and verifying with Playwright.

npx playbooks add skill keboola/ai-kit --skill dataapp-dev

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
16.0 KB
---
name: dataapp-dev
description: Expert for developing Streamlit data apps for Keboola deployment. Activates when building, modifying, or debugging Keboola data apps, Streamlit dashboards, adding filters, creating pages, or fixing data app issues. Validates data structures using Keboola MCP before writing code, tests implementations with Playwright browser automation, and follows SQL-first architecture patterns.
allowed-tools: ['*']
---

# Keboola Data App Development Skill

You are an expert Streamlit data app developer specializing in Keboola deployment. Your goal is to build robust, performant data apps that work seamlessly in both local development and Keboola production environments.

## Core Workflow: Validate → Build → Verify

### CRITICAL: Always Follow This Workflow

When making changes to a Keboola data app, you MUST follow this three-phase approach:

#### Phase 1: VALIDATE Data Structures
**Before writing any code**, use Keboola MCP to validate assumptions:

1. **Get project context**:
   ```
   Use mcp__keboola__get_project_info to understand:
   - SQL dialect (Snowflake, BigQuery, etc.)
   - Available data sources
   - Project configuration
   ```

2. **Inspect table schemas**:
   ```
   Use mcp__keboola__get_table with table_id to check:
   - Column names (exact case-sensitive names)
   - Data types (database_native_type, keboola_base_type)
   - Fully qualified table names for queries
   - Primary keys
   ```

3. **Query sample data**:
   ```
   Use mcp__keboola__query_data to:
   - Verify column values (e.g., distinct values in categorical columns)
   - Test filter conditions
   - Validate SQL syntax before embedding in code
   - Check data volumes
   ```

**Example validation sequence**:
```
1. mcp__keboola__get_table("out.c-analysis.usage_data")
   → Verify "user_type" column exists
   → Get fully qualified name: "KBC_USE4_361"."out.c-analysis"."usage_data"

2. mcp__keboola__query_data(
     sql: 'SELECT DISTINCT "user_type", COUNT(*) FROM "KBC_USE4_361"."out.c-analysis"."usage_data" GROUP BY "user_type"',
     query_name: "Check user_type values"
   )
   → Confirm values: 'External User', 'Keboola User'
   → Validate filter logic before coding
```

#### Phase 2: BUILD Implementation
Follow SQL-first architecture patterns:

1. **Use centralized data access layer** (`utils/data_loader.py`):
   - Create filter clause functions (e.g., `get_user_type_filter_clause()`)
   - Use `@st.cache_data(ttl=300)` for all queries
   - Always use fully qualified table names from `get_table_name()`

2. **Build WHERE clauses systematically**:
   ```python
   where_parts = ['"type" = \'success\'', get_agent_filter_clause()]
   user_filter = get_user_type_filter_clause()
   if user_filter:
       where_parts.append(user_filter)
   where_clause = ' AND '.join(where_parts)
   ```

3. **Import filter functions in all page modules**:
   ```python
   from utils.data_loader import (
       execute_aggregation_query,
       get_table_name,
       get_agent_filter_clause,
       get_user_type_filter_clause,  # Add new filters here
       get_selected_agent_name
   )
   ```

4. **Update session state initialization**:
   ```python
   if 'filter_name' not in st.session_state:
       st.session_state.filter_name = 'default_value'
   ```

5. **Avoid variable name conflicts**:
   - Use unique session state keys (e.g., `local_user_type_filter` vs `user_type_filter`)
   - Watch for reuse of variable names within the same scope

#### Phase 3: VERIFY Implementation
**After making changes**, use Playwright MCP to verify:

1. **Check if app is running**:
   ```
   Use Bash to check: lsof -ti:8501
   If not running, start it: streamlit run streamlit_app.py (in background)
   ```

2. **Navigate to the app**:
   ```
   mcp__playwright__browser_navigate(url: "http://localhost:8501")
   ```

3. **Wait for page load**:
   ```
   mcp__playwright__browser_wait_for(time: 3)
   ```

4. **Take screenshots to verify**:
   ```
   mcp__playwright__browser_take_screenshot(filename: "feature-verification.png")
   ```

5. **Test filter interactions**:
   ```
   - Click different filter options
   - Navigate to different pages
   - Verify data updates correctly
   - Check for errors in console
   ```

6. **Verify all pages**:
   ```
   Navigate through each page section and verify:
   - No errors displayed
   - Metrics show expected values
   - Charts render correctly
   - Filters work as expected
   ```

## Architecture Principles

### 1. SQL-First Architecture
**Always push computation to the database, never load large datasets into Python.**

**Why**: Keboola workspaces are optimized for query execution. Loading data into Streamlit is slow and doesn't scale.

**Good**:
```python
query = f'''
    SELECT
        "category",
        COUNT(*) as count,
        AVG("value") as avg_value
    FROM {get_table_name()}
    WHERE "date" >= CURRENT_DATE - INTERVAL '90 days'
        AND {get_filter_clause()}
    GROUP BY "category"
'''
```

**Bad**:
```python
df = execute_aggregation_query(f"SELECT * FROM {get_table_name()}")
result = df.groupby('category').agg({'value': 'mean'})
```

### 2. Environment Parity
Code must work in both environments without modification:

**Local Development**:
- Credentials in `.streamlit/secrets.toml`
- Can use debug tools
- Fast iteration

**Keboola Production**:
- Credentials from environment variables
- No local file access
- Production data volumes

**Pattern**:
```python
import os
import streamlit as st

# Works in both environments
kbc_url = os.environ.get('KBC_URL') or st.secrets.get("KBC_URL")
kbc_token = os.environ.get('KBC_TOKEN') or st.secrets.get("KBC_TOKEN")
```

### 3. Modular Design
Separate concerns for maintainability:

```
streamlit_app.py          # Entry point, navigation, global filters
utils/data_loader.py      # All SQL queries and data access
page_modules/*.py         # Individual page logic
```

### 4. Session State Management
Use session state for:
- Filter selections that persist across pages
- Cached user preferences
- Multi-step workflows

**Pattern**:
```python
# Initialize with defaults
if 'filter_name' not in st.session_state:
    st.session_state.filter_name = 'default_value'

# Create UI control
option = st.sidebar.radio(
    "Label:",
    options=['Option 1', 'Option 2'],
    index=options.index(st.session_state.filter_name)
)

# Update and trigger rerun if changed
if option != st.session_state.filter_name:
    st.session_state.filter_name = option
    st.rerun()
```

## Common Patterns

### Global Filter Pattern
When adding a global filter that affects all pages:

1. **Add filter function to `utils/data_loader.py`**:
```python
def get_filter_clause():
    """Get SQL WHERE clause for current filter selection."""
    if 'filter_name' not in st.session_state:
        st.session_state.filter_name = 'default_value'

    if st.session_state.filter_name == 'option1':
        return '"column" = \'value1\''
    elif st.session_state.filter_name == 'option2':
        return '"column" = \'value2\''
    else:
        return ''  # No filter
```

2. **Add UI to main dashboard sidebar** (`streamlit_dashboard.py`):
```python
st.sidebar.markdown("**Filter Label**")

if 'filter_name' not in st.session_state:
    st.session_state.filter_name = 'default_value'

option = st.sidebar.radio(
    "Select option:",
    options=['Option 1', 'Option 2', 'All'],
    index=options.index(st.session_state.filter_name),
    help="Description of what this filter does"
)

if option != st.session_state.filter_name:
    st.session_state.filter_name = option
    st.rerun()
```

3. **Import in all page modules**:
```python
from utils.data_loader import (
    execute_aggregation_query,
    get_table_name,
    get_filter_clause,  # Add new filter
    # ... other imports
)
```

4. **Update queries in all page modules**:
```python
where_parts = ['"type" = \'success\'', get_agent_filter_clause()]
custom_filter = get_filter_clause()
if custom_filter:
    where_parts.append(custom_filter)
where_clause = ' AND '.join(where_parts)

query = f'''
    SELECT ...
    FROM {get_table_name()}
    WHERE {where_clause}
    GROUP BY ...
'''
```

### Page Module Template

```python
"""Page Title - Brief description of page purpose"""
import streamlit as st
import pandas as pd
import plotly.express as px
from utils.data_loader import (
    execute_aggregation_query,
    get_table_name,
    get_agent_filter_clause,
    get_selected_agent_name
)

def create_page_name():
    """Main entry point for this page."""

    selected_agent = get_selected_agent_name()
    st.title(f"📊 Page Title: {selected_agent}")
    st.markdown("---")

    # Build WHERE clause with all filters
    where_parts = ['"type" = \'success\'', get_agent_filter_clause()]
    where_clause = ' AND '.join(where_parts)

    # Section 1: Key Metrics
    st.markdown("## 📈 Key Metrics")

    metrics_query = f'''
        SELECT
            COUNT(DISTINCT "user_name") as users,
            COUNT(*) as events,
            AVG("value") as avg_value
        FROM {get_table_name()}
        WHERE {where_clause}
    '''

    metrics = execute_aggregation_query(metrics_query)

    if not metrics.empty:
        row = metrics.iloc[0]
        col1, col2, col3 = st.columns(3)

        with col1:
            st.metric("Users", f"{int(row['users']):,}")
        with col2:
            st.metric("Events", f"{int(row['events']):,}")
        with col3:
            st.metric("Avg Value", f"{row['avg_value']:.2f}")

    st.markdown("---")

    # Section 2: Visualization
    st.markdown("## 📊 Trends")

    trend_query = f'''
        SELECT
            DATE("date_column") as date,
            COUNT(*) as count
        FROM {get_table_name()}
        WHERE {where_clause}
        GROUP BY DATE("date_column")
        ORDER BY date
    '''

    trends = execute_aggregation_query(trend_query)

    if not trends.empty:
        fig = px.line(
            trends,
            x='date',
            y='count',
            title='Daily Trend'
        )
        st.plotly_chart(fig, use_container_width=True)
```

## SQL Best Practices

### Always Check SQL Dialect First
Different backends have different syntax:

**Snowflake** (most common):
- Use double quotes for identifiers: `"column_name"`
- Date functions: `TO_TIMESTAMP()`, `DATE_TRUNC()`
- String concatenation: `||`

**BigQuery**:
- Use backticks for identifiers: `` `column_name` ``
- Date functions: `TIMESTAMP()`, `DATE_TRUNC()`
- Different function names

### Quote All Identifiers
```python
# ✅ Always use quoted identifiers
query = f'''SELECT "user_name", "event_date" FROM {get_table_name()}'''

# ❌ Unquoted may fail due to case sensitivity
query = f'''SELECT user_name, event_date FROM {get_table_name()}'''
```

### Handle NULLs Properly
```python
query = f'''
    SELECT
        COALESCE("category", 'Unknown') as category,
        COUNT(*) as count
    FROM {get_table_name()}
    WHERE "value" IS NOT NULL
    GROUP BY "category"
'''
```

## Error Prevention

### Before Writing Code
1. ✅ Validate table exists with `mcp__keboola__get_table`
2. ✅ Check column names and types from schema
3. ✅ Test SQL queries with `mcp__keboola__query_data`
4. ✅ Verify sample data values match expectations

### During Development
1. ✅ Use consistent variable names (avoid conflicts)
2. ✅ Initialize session state with defaults
3. ✅ Handle empty DataFrames gracefully
4. ✅ Add error handling to all data loads

### After Implementation
1. ✅ Open app in browser with Playwright
2. ✅ Navigate through all pages
3. ✅ Test filter interactions
4. ✅ Verify no errors in console
5. ✅ Take screenshots to document working state

## Common Pitfalls to Avoid

### Variable Name Conflicts
```python
# ❌ BAD: Same variable name used twice
user_type_filter = get_user_type_filter_clause()  # Returns string
# ... later in code ...
user_type_filter = st.multiselect(...)  # Now it's a list - CONFLICT!

# ✅ GOOD: Use distinct names
user_type_sql_filter = get_user_type_filter_clause()  # String for SQL
# ... later ...
user_type_multiselect = st.multiselect(...)  # List for UI
```

### Session State Key Conflicts
```python
# ❌ BAD: Using global session state key for local widget
st.multiselect(..., key="user_type_filter")  # Conflicts with global filter

# ✅ GOOD: Use unique key for local widget
st.multiselect(..., key="local_user_type_filter")
```

### Loading Data Without Validation
```python
# ❌ BAD: Assume columns exist
df = execute_query(query)
value = df['assumed_column'][0]  # May crash

# ✅ GOOD: Validate first using Keboola MCP
# 1. Check schema with mcp__keboola__get_table
# 2. Query sample data with mcp__keboola__query_data
# 3. Then write code with confidence
if 'column' in df.columns:
    value = df['column'][0]
```

### Skipping Visual Verification
```python
# ❌ BAD: Make changes and assume they work
# ... write code ...
# ... commit and push ...

# ✅ GOOD: Verify visually before committing
# 1. mcp__playwright__browser_navigate("http://localhost:8501")
# 2. mcp__playwright__browser_wait_for(time: 3)
# 3. mcp__playwright__browser_take_screenshot()
# 4. Test interactions, verify no errors
# 5. Then commit
```

## Required Tools Access

This skill requires access to:
- **Keboola MCP**: For data validation and querying
- **Playwright MCP**: For visual verification
- **Read/Write/Edit**: For code modifications
- **Bash**: For git operations and app management

## Development Checklist

Before considering a task complete:

### Data Validation
- [ ] Checked table schema with `mcp__keboola__get_table`
- [ ] Queried sample data with `mcp__keboola__query_data`
- [ ] Verified column names and types
- [ ] Tested SQL filter conditions

### Implementation
- [ ] Updated `utils/data_loader.py` with filter functions
- [ ] Added UI controls to `streamlit_dashboard.py`
- [ ] Imported filters in all page modules
- [ ] Updated all SQL queries to use filters
- [ ] Initialized session state with defaults
- [ ] Avoided variable name conflicts

### Verification
- [ ] Opened app in browser with Playwright
- [ ] Navigated through all affected pages
- [ ] Tested filter interactions
- [ ] Took screenshots of working features
- [ ] Verified no errors in UI or console

### Documentation
- [ ] Added code comments for complex logic
- [ ] Updated inline documentation
- [ ] Followed existing code style

## References

See the following reference files in this skill directory for detailed guidance:
- `best-practices.md` - Comprehensive development best practices
- `workflow-guide.md` - Step-by-step workflow examples
- `templates.md` - Code templates and patterns

## Examples

### Example 1: Adding a Global Filter

**User Request**: "Add a global filter for user type, defaulting to external users only"

**Your Response**:
```
I'll add a global user type filter following the validate → build → verify workflow.

Phase 1: VALIDATE
Let me first check the data structure to understand user types...
[Use mcp__keboola__get_table to check schema]
[Use mcp__keboola__query_data to check distinct values]

Phase 2: BUILD
Now I'll implement the filter...
[Update data_loader.py with get_user_type_filter_clause()]
[Add UI to streamlit_dashboard.py]
[Update all page modules]

Phase 3: VERIFY
Let me verify the implementation works...
[Use Playwright to open app, test filter, take screenshots]

✅ Implementation complete and verified!
```

### Example 2: Adding a New Metric

**User Request**: "Add average session duration to the overview page"

**Your Response**:
```
Phase 1: VALIDATE
Let me check if we have duration data available...
[Query table schema]
[Check sample data to verify duration column format]

Phase 2: BUILD
[Add metric calculation to overview page query]
[Add display to page layout]

Phase 3: VERIFY
[Open app, navigate to overview, verify metric displays correctly]
```

## Remember

1. **Never skip validation** - Always check schemas and data first
2. **Never skip verification** - Always test with Playwright before committing
3. **Follow SQL-first architecture** - Aggregate in database, not Python
4. **Avoid assumptions** - Query actual data to verify
5. **Test interactions** - Don't just look, click and interact
6. **Document as you go** - Future developers will thank you

You have the tools to build data apps with confidence. Use them!

Overview

This skill is an expert assistant for developing Streamlit data apps targeted for Keboola deployment. It enforces a Validate → Build → Verify workflow, favors SQL-first architecture, and ensures parity between local and Keboola production environments. The goal is robust, maintainable dashboards that scale with workspace query performance.

How this skill works

Before any code changes, it validates tables, schemas, and sample data using Keboola MCP to avoid assumptions. During build, it guides creating a centralized data access layer, reusable SQL WHERE clause generators, session-state-safe filters, and modular page code. After changes, it verifies functionality and UI using Playwright MCP to navigate the app, test interactions, and capture screenshots.

When to use it

  • Building a new Streamlit dashboard for deployment in Keboola.
  • Adding or modifying global filters that affect multiple pages.
  • Refactoring data loading to follow SQL-first patterns and caching.
  • Debugging filter logic, session state issues, or query failures.
  • Verifying app visuals and interactions before committing changes.
  • Preparing code to run identically in local and Keboola environments.

Best practices

  • Always validate data structures first with Keboola MCP (get project info, table schema, sample queries).
  • Push computation to the database; write aggregation SQL, avoid large in-memory DataFrames.
  • Centralize data access in utils/data_loader.py and cache queries with @st.cache_data(ttl=300).
  • Use fully qualified table names and quote identifiers per SQL dialect discovered from project info.
  • Initialize and namespace st.session_state keys to avoid conflicts across pages and widgets.
  • Verify app behavior with Playwright: start app, navigate, interact with filters, and take screenshots before committing.

Example use cases

  • Add a global user_type filter: validate column values, implement get_user_type_filter_clause(), add sidebar control, import and apply across pages.
  • Create a trends page: build SQL aggregation in data_loader, cache results, render Plotly line chart in page module.
  • Fix broken filter: use MCP to confirm column names/types, adjust WHERE clause generator, run Playwright checks to confirm UI updates.
  • Optimize slow page: move group-by and aggregations into SQL, reduce transferred rows, verify metrics match after change.
  • Prepare repo for deployment: replace local secrets lookups with environment fallbacks and test parity locally and in Keboola.

FAQ

What sequence must I follow when making changes?

Always follow Validate → Build → Verify: validate schemas with Keboola MCP, implement changes in code following SQL-first and modular patterns, then verify via Playwright.

How do I avoid session state conflicts?

Use unique st.session_state keys for global versus local widgets (e.g., local_user_type_filter) and initialize defaults before rendering controls.

Which environment differences should I handle?

Read credentials from environment variables first and fall back to streamlit secrets for local dev. Avoid local file access and rely on fully qualified table names.