home / skills / benchflow-ai / skillsbench / bn-fit-modify
This skill guides Bayesian Network structure learning, parameter fitting, interventions, and sampling using pgmpy to recover DAGs and reason about causal
npx playbooks add skill benchflow-ai/skillsbench --skill bn-fit-modifyReview the files below or copy the command above to add this skill to your agents.
---
name: bn-fit-modify
description: Guide for Bayesian Network tasks involving structure learning, parameter fitting, intervention, and sampling. This skill should be used when working with pgmpy or similar libraries to recover DAG structures from data, fit conditional probability distributions, perform causal interventions (do-calculus), or sample from modified networks.
---
# Bayesian Network Fitting and Modification
## Overview
This skill provides guidance for tasks involving Bayesian Network structure recovery, parameter estimation, causal intervention, and sampling. It covers common pitfalls when using libraries like pgmpy for Linear Gaussian Bayesian Networks and other BN types.
## Workflow
### Phase 1: Structure Learning
When recovering a DAG structure from observational data:
1. **Explore the data first** - Understand variable types, distributions, and potential relationships before applying algorithms
2. **Choose appropriate algorithms** based on data size:
- For large datasets, constraint-based methods (PC algorithm) may cause memory issues
- Score-based methods (HillClimbSearch) can also be memory-intensive
- Consider correlation-based greedy approaches for very large datasets
3. **Apply domain constraints** - If constraints are given (e.g., "variable U has no parents", "exactly N edges"), incorporate them into the search
4. **Handle ambiguous edges** - When multiple edges have similar scores, follow any specified ordering rules (e.g., alphabetical) for deterministic results
### Phase 2: Parameter Estimation
After structure is determined:
1. **Use library methods for fitting** - Always prefer `model.fit(data)` over manual parameter computation
2. **Verify fitted parameters** - Print and inspect CPD parameters after fitting:
```python
for cpd in model.get_cpds():
print(cpd)
```
3. **For Linear Gaussian BNs**, verify:
- Intercept values
- Coefficient values for each parent
- Variance estimates
4. **Compare against expected values** if test cases provide them
### Phase 3: Intervention (do-calculus)
When performing causal interventions:
1. **Understand intervention semantics**: `do(X=x)` means:
- Remove ALL incoming edges to X (X no longer depends on its parents)
- Fix X to the specified value x
- Keep all outgoing edges from X (X still influences its children)
2. **Create the interventional DAG correctly**:
```python
# Remove incoming edges to intervention variable
intervened_dag = original_dag.copy()
for parent in list(intervened_dag.predecessors(intervention_var)):
intervened_dag.remove_edge(parent, intervention_var)
```
3. **Re-fit or transfer parameters appropriately**:
- Parameters for non-intervened variables remain the same
- The intervened variable becomes a constant (delta distribution)
### Phase 4: Sampling from Intervened Network
**Critical: Use library methods, not custom implementations**
1. **Prefer built-in sampling methods**:
```python
# If library supports intervention in sampling
samples = model.simulate(n_samples, do={intervention_var: value})
```
2. **If manual sampling is necessary**, ensure correct handling:
- Sample in topological order of the DAG
- For the intervened variable, use an array of the fixed value (not a scalar):
```python
# Correct
samples[intervention_var] = np.full(n_samples, intervention_value)
# Incorrect - causes broadcasting issues
samples[intervention_var] = intervention_value
```
- For downstream variables, use the fixed intervention value correctly in conditional distributions
3. **Verify array dimensions** at each step to catch broadcasting errors early
## Verification Strategies
### Incremental Testing
Test each phase independently before proceeding:
1. **Structure verification**: Compare learned edges against expected structure
2. **Parameter verification**: Compare fitted parameters against expected values
3. **Intervention verification**: Confirm correct edges removed and added
4. **Sampling verification**: Check sample statistics match expected distributions
### Statistical Validation
For final samples:
1. Compute mean and variance of sampled values
2. Compare against theoretical expectations from the fitted model
3. Use statistical tests (e.g., chi-square, KS test) to validate distributions
4. A p-value of 0.0 indicates fundamental errors in sampling logic
### Debugging Checklist
When tests fail:
- [ ] Are fitted parameters correct? Print and verify.
- [ ] Is the intervention DAG correct? Visualize or print edges.
- [ ] Are array dimensions consistent throughout sampling?
- [ ] Is the intervention value propagating correctly to downstream variables?
- [ ] Are library methods being used where available?
## Common Pitfalls
### 1. Custom Sampling Instead of Library Methods
**Problem**: Writing custom sampling functions when library methods exist.
**Why it fails**: Custom implementations often have subtle bugs in:
- Array dimension handling
- Conditional distribution computation
- Topological ordering
**Solution**: Always check if the library provides sampling with intervention support before implementing custom code.
### 2. Scalar vs Array for Fixed Values
**Problem**: Setting intervention variable to a scalar instead of an array.
```python
# Wrong
samples['Y'] = 0.0
# Correct
samples['Y'] = np.zeros(n_samples)
```
**Why it fails**: Causes broadcasting issues when used as parent values in conditional distributions.
### 3. Incorrect Intervention Semantics
**Problem**: Misunderstanding what `do(X=x)` means in causal inference.
**Common mistakes**:
- Keeping incoming edges to X
- Removing outgoing edges from X
- Not fixing X to the exact value
**Solution**: Review Pearl's do-calculus - intervention removes causes (parents) but preserves effects (children).
### 4. Not Verifying Intermediate Results
**Problem**: Running the full pipeline and only checking final outputs.
**Why it fails**: Errors compound through the pipeline; early errors produce misleading final results.
**Solution**: Verify structure, then parameters, then intervention, then sampling - each step independently.
### 5. Memory Issues with Large Data
**Problem**: OOM errors with PC or HillClimbSearch algorithms on large datasets.
**Solutions**:
- Subsample the data for structure learning
- Use simpler correlation-based approaches
- Increase available memory
- Use algorithms with lower memory footprint
### 6. Ignoring Numerical Precision
**Problem**: Floating-point precision issues in parameter estimation and sampling.
**Solution**:
- Use appropriate tolerances in comparisons
- Check for near-zero variances that could cause division issues
- Validate that covariance matrices are positive definite
## Code Organization
- Maintain a single script and iterate on it rather than creating multiple versions (v1, v2, v3)
- Use functions to separate structure learning, parameter fitting, intervention, and sampling
- Add assertions or checks after each major step
- Clean up intermediate files after successful completion
This skill guides Bayesian Network tasks for structure learning, parameter fitting, causal intervention (do-calculus), and sampling. It focuses on practical steps and checks when using libraries like pgmpy for Linear Gaussian and other BN types. Use it to recover DAGs, fit CPDs, perform interventions, and generate valid samples from modified networks.
The skill breaks the workflow into four phases: structure learning, parameter estimation, causal intervention, and sampling. It describes algorithm selection, how to enforce domain constraints, using library fit methods, constructing interventional DAGs (removing incoming edges to intervention variables), and sampling correctly (prefer built-in methods or follow topological order with correct array shapes). It also lists verification and debugging strategies to validate each phase independently.
What does do(X=x) change in the DAG?
do(X=x) removes all incoming edges to X and fixes X to value x while preserving outgoing edges; other variables keep their original parameters.
Why are my sampled downstream values incorrect?
Common causes: intervention assigned as a scalar instead of an array, wrong topological sampling order, or CPDs not verified. Ensure the intervened variable is an array of length n_samples and sample in topological order, or use built-in intervention-aware sampling.