home / skills / benchflow-ai / skillsbench / mhc-algorithm

mhc-algorithm skill

safe

/tasks/mhc-layer-impl/environment/skills/mhc-algorithm

This skill helps stabilize deep network training by applying manifold-constrained hyper-connections with doubly stochastic mixing and Sinkhorn projection.

npx playbooks add skill benchflow-ai/skillsbench --skill mhc-algorithm

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

4.0 KB

---
name: mhc-algorithm
description: Implement mHC (Manifold-Constrained Hyper-Connections) for stabilizing deep network training. Use when implementing residual connection improvements with doubly stochastic matrices via Sinkhorn-Knopp algorithm. Based on DeepSeek's 2025 paper (arXiv:2512.24880).
---

# mHC: Manifold-Constrained Hyper-Connections

## Overview

mHC (Manifold-Constrained Hyper-Connections) stabilizes deep network training by constraining residual mixing matrices to be **doubly stochastic**. It provides:

- **Stable Training**: Lower gradient norm variance via doubly stochastic constraints
- **Multiple Streams**: Hyper-Connections with learnable mixing across residual streams
- **Sinkhorn Projection**: Log-space Sinkhorn-Knopp algorithm for doubly stochastic projection
- **GPT Integration**: Pattern for wrapping attention and MLP layers

Two components:
- **HyperConnections Module**: Core PyTorch module with H_res, H_pre, H_post matrices
- **Sinkhorn-Knopp**: Log-space projection to doubly stochastic manifold

## Quick Reference

| Topic | Reference |
|-------|-----------|
| Core Concepts & Math | [Core Concepts](references/core-concepts.md) |
| Sinkhorn Algorithm | [Sinkhorn-Knopp](references/sinkhorn-knopp.md) |
| HyperConnections Module | [Module Implementation](references/module-implementation.md) |
| GPT Integration | [GPT Integration](references/gpt-integration.md) |
| Common Pitfalls | [Pitfalls](references/pitfalls.md) |

## Installation

```python
# Required packages
pip install torch einops numpy
```

## Minimal Example

```python
import torch
import torch.nn as nn
from einops import rearrange, einsum

def sinkhorn_knopp(logits, num_iters=20, tau=0.05):
    log_alpha = logits / tau
    for _ in range(num_iters):
        log_alpha = log_alpha - torch.logsumexp(log_alpha, dim=-1, keepdim=True)
        log_alpha = log_alpha - torch.logsumexp(log_alpha, dim=-2, keepdim=True)
    return torch.exp(log_alpha)

class HyperConnections(nn.Module):
    def __init__(self, num_streams, dim, branch=None, layer_idx=0):
        super().__init__()
        self.num_streams = num_streams
        self.branch = branch

        # Initialize H_res near identity (use small negative for gradient flow)
        init_h_res = torch.full((num_streams, num_streams), -0.1)
        init_h_res.fill_diagonal_(0.0)
        self.H_res_logits = nn.Parameter(init_h_res)

        # H_pre/H_post for depth connections
        init_h_pre = torch.full((1, num_streams), -0.1)
        init_h_pre[0, layer_idx % num_streams] = 0.0
        self.H_pre_logits = nn.Parameter(init_h_pre)
        self.H_post_logits = nn.Parameter(torch.zeros(1, num_streams))

    def forward(self, x):
        s = self.num_streams
        x = rearrange(x, "(b s) t d -> b t s d", s=s)

        h_res = sinkhorn_knopp(self.H_res_logits)
        x_mixed = einsum(h_res, x, "s t, b n s d -> b n t d")

        h_pre = self.H_pre_logits.softmax(dim=-1)
        branch_in = einsum(h_pre, x, "v s, b n s d -> b n v d").squeeze(-2)

        branch_out = self.branch(branch_in) if self.branch else branch_in

        h_post = self.H_post_logits.softmax(dim=-1)
        depth_out = einsum(branch_out, h_post, "b t d, v s -> b t s d")

        output = x_mixed + depth_out
        return rearrange(output, "b t s d -> (b s) t d")
```

## Common Imports

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange, einsum, repeat, reduce
```

## When to Use What

| Scenario | Approach |
|----------|----------|
| Standard residual connection | No mHC needed |
| Deep networks (>12 layers) with stability issues | Use mHC with num_streams=4 |
| GPT/Transformer training | Wrap both attention and MLP with HyperConnections |
| Custom Sinkhorn iterations | Adjust num_iters (20 default) and tau (0.05 default) |
| Memory-constrained training | Reduce num_streams or batch size |

## External Resources

- mHC Paper: https://arxiv.org/abs/2512.24880
- Hyper-Connections: https://arxiv.org/abs/2409.19606
- Sinkhorn's Theorem: https://en.wikipedia.org/wiki/Sinkhorn%27s_theorem

Overview

This skill implements mHC (Manifold-Constrained Hyper-Connections) to stabilize deep network training by enforcing doubly stochastic residual mixing. It provides a compact HyperConnections module and a numerically stable log-space Sinkhorn-Knopp projector to constrain learnable mixing matrices. Use it to add controlled, learnable residual mixing across multiple streams and to wrap attention and MLP blocks for improved gradient behavior.

How this skill works

The skill creates H_res, H_pre, and H_post logits which are projected into valid probability matrices. H_res is projected to a doubly stochastic matrix using a log-space Sinkhorn-Knopp loop (soft normalization in rows and columns). H_pre and H_post use softmax for depth-wise mixing. The HyperConnections forward pass rearranges inputs into streams, mixes via H_res, passes selected streams through an optional branch (e.g., attention or MLP), and merges outputs using H_post before returning the combined residual.

When to use it

Training deep residual models that suffer instability or exploding/vanishing gradients
Transformer/GPT-style architectures where both attention and MLP blocks need stabilized residuals
Very deep networks (>12 layers) where standard skip connections degrade performance
When you need learnable, structured mixing across residual streams to improve gradient flow
Prototyping alternative residual strategies using doubly stochastic constraints

Best practices

Initialize H_res logits near identity (small negative off-diagonals, zeros on diag) to preserve initial residual paths
Use log-space Sinkhorn with a small temperature (tau ~0.05) and ~20 iterations for stable projection
Keep num_streams modest (e.g., 2–8) to balance expressivity and memory; increase only if justified
Wrap both attention and MLP modules to get consistent stabilization across layers
Tune num_iters and tau for your dataset; fewer iterations reduce cost but may weaken doubly stochastic enforcement

Example use cases

Stabilizing training of deep GPT-style models by wrapping attention and feed-forward blocks with HyperConnections
Improving convergence in very deep ResNet variants by replacing plain residual mixes with doubly stochastic mixing
Experimenting with learned inter-layer routing where H_pre/H_post control depth-wise contribution
Memory-aware training: reduce num_streams to trade mixing capacity for lower memory use
Research on architecture priors that constrain mixing to a manifold (doubly stochastic) for better generalization

FAQ

How many Sinkhorn iterations are needed?

20 iterations is a reliable default; reduce to 5–10 for lower cost, or increase if the projection shows numerical drift.

Why use log-space Sinkhorn?

Log-space stabilizes normalization for small temperatures and prevents underflow/overflow when exponentiating logits to probabilities.