Understanding Parallel Layers

This guide explains Loom's Parallel layer—how to run multiple sub-networks simultaneously, combine their outputs in different ways, and build complex architectures like Mixture of Experts.

What is a Parallel Layer?

A Parallel layer splits input, processes it through multiple "branches", then combines the results:

Text

                        Input
                          │
            ┌─────────────┼─────────────┐
            │             │             │
            ▼             ▼             ▼
        ┌───────┐     ┌───────┐     ┌───────┐
        │Branch │     │Branch │     │Branch │
        │   0   │     │   1   │     │   2   │
        │(Dense)│     │(LSTM) │     │ (MHA) │
        └───┬───┘     └───┬───┘     └───┬───┘
            │             │             │
            └─────────────┼─────────────┘
                          │
                      Combine
                    (concat/add/avg/filter/grid_scatter)
                          │
                          ▼
                       Output

Each branch can be a different layer type—this is what makes Parallel layers so powerful.

Combine Modes

The key decision: How do you combine branch outputs?

Concat (Default)

Concatenate all outputs into one large vector:

Text

Branch 0 output: [a, b, c]      (3 values)
Branch 1 output: [d, e]         (2 values)
Branch 2 output: [f, g, h, i]   (4 values)

Combined (concat): [a, b, c, d, e, f, g, h, i]  (9 values)

Use when: - Branches produce different feature types - You want next layer to see all information - Output sizes differ between branches

Add

Element-wise addition (requires same-sized outputs):

Text

Branch 0 output: [1.0, 2.0, 3.0]
Branch 1 output: [0.5, 0.5, 0.5]
Branch 2 output: [0.2, 0.3, 0.2]

Combined (add): [1.7, 2.8, 3.7]

Use when: - Branches are processing the same features differently - You want to aggregate responses - Building residual-like connections

Average

Element-wise average (requires same-sized outputs):

Text

Branch 0 output: [1.0, 2.0, 3.0]
Branch 1 output: [0.5, 0.5, 0.5]
Branch 2 output: [0.2, 0.3, 0.2]

Combined (avg): [0.57, 0.93, 1.23]  (mean of each position)

Use when: - Building ensemble predictions - You want balanced contribution from each branch

Grid Scatter

Place outputs at specific 2D/3D grid positions:

Text

Branch 0 → position (0, 0)
Branch 1 → position (0, 1)
Branch 2 → position (1, 0)
Branch 3 → position (1, 1)

Grid output:
    ┌─────────────┬─────────────┐
    │  Branch 0   │  Branch 1   │  Row 0
    │  output     │  output     │
    ├─────────────┼─────────────┤
    │  Branch 2   │  Branch 3   │  Row 1
    │  output     │  output     │
    └─────────────┴─────────────┘
       Col 0         Col 1

Use when: - Building spatially-aware architectures - Multi-agent systems with spatial positioning - Image processing with region-specific branches

Filter (Softmax-Gated) — Dynamic Logic Gates

The most powerful combine mode: a learnable gate network decides how much to use each branch.

Text

                              Input
                                │
        ┌───────────────────────┼───────────────────────┐
        │                       │                       │
        ▼                       ▼                       ▼
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Gate Network  │       │   Expert 0    │       │   Expert 1    │
│ (Dense→Softmax)│       │   (Dense)     │       │   (LSTM)      │
└───────┬───────┘       └───────┬───────┘       └───────┬───────┘
        │                       │                       │
        ▼                       ▼                       ▼
   [0.7, 0.3]           [e0_out...]             [e1_out...]
        │                       │                       │
        └───────────────────────┼───────────────────────┘
                                │
                          Weighted Sum:
                    0.7 × Expert0 + 0.3 × Expert1
                                │
                                ▼
                            Output

This is a learnable conditional computation system—the gate learns WHEN to use each expert.

How Filter Mode Works

Text

Gate network predicts: [0.6, 0.3, 0.1]  (sum to 1.0 via softmax)

Branch 0 output: [1.0, 2.0, 3.0]  × 0.6 = [0.60, 1.20, 1.80]
Branch 1 output: [0.5, 0.5, 0.5]  × 0.3 = [0.15, 0.15, 0.15]
Branch 2 output: [0.2, 0.3, 0.2]  × 0.1 = [0.02, 0.03, 0.02]

Combined (filter): [0.77, 1.38, 1.97]

The Power: Dynamic Learned Routing

Unlike static combine modes, the gate network learns from data which expert to use for which inputs:

Text

Training data:
    Pattern A: input[0] > 0.5 → Expert 0 is better
    Pattern B: input[0] ≤ 0.5 → Expert 1 is better

After training:
    Input with high input[0] → gate outputs [0.95, 0.05]
    Input with low input[0]  → gate outputs [0.10, 0.90]

The gate has learned to route each input to the right expert!

Building a Filtered Parallel Layer

inputSize := 16
expertSize := 8

// Create two expert branches (can be any layer type)
expert1 := nn.InitDenseLayer(inputSize, expertSize, nn.ActivationLeakyReLU)
expert2 := nn.InitDenseLayer(inputSize, expertSize, nn.ActivationLeakyReLU)

// Create gate layer: Input → 2 outputs (one per expert)
gateLayer := nn.InitDenseLayer(inputSize, 2, nn.ActivationScaledReLU)

// Build the filtered parallel layer
filterLayer := nn.LayerConfig{
    Type:              nn.LayerParallel,
    ParallelBranches:  []nn.LayerConfig{expert1, expert2},
    CombineMode:       "filter",
    FilterGateConfig:  &gateLayer,
    FilterSoftmax:     nn.SoftmaxStandard,  // How to normalize gate outputs
    FilterTemperature: 1.0,                 // Controls routing sharpness
}

// Add to network
network.SetLayer(0, 0, 1, filterLayer)

Gate Configuration Options

FilterSoftmax — Controls how gate outputs are normalized:

Type	Effect	Use Case
`SoftmaxStandard`	Smooth routing, all experts get some weight	Ensemble learning
`SoftmaxEntmax`	Sparse routing, some experts get exact zero	Efficiency
`SoftmaxSparsemax`	Very sparse, picks 1-2 experts	Hard routing
`SoftmaxTemperature`	Adjustable sharpness	Curriculum learning

FilterTemperature — Controls routing "sharpness":

Text

Temperature = 1.0 (default):
    Gate logits [2.0, 1.0] → [0.73, 0.27]  (soft mix)

Temperature = 0.5 (sharper):
    Gate logits [2.0, 1.0] → [0.88, 0.12]  (mostly expert 0)

Temperature = 0.1 (nearly hard):
    Gate logits [2.0, 1.0] → [0.99, 0.01]  (almost hard selection)

Advanced: Expert Pre-Training + Gate Training

A powerful pattern: train experts separately, then train just the gate.

// ============================================================
// STEP 1: Pre-train Expert 1 on "high signal" patterns
// ============================================================
expert1 := nn.InitDenseLayer(8, 8, nn.ActivationSigmoid)
e1Net := nn.NewNetwork(8, 1, 1, 1)
e1Net.SetLayer(0, 0, 0, expert1)

// Train: high first element → high output
trainData1 := make([]nn.TrainingBatch, 2000)
for i := range trainData1 {
    input := randomInput(8)
    if rand.Float32() > 0.5 {
        input[0] = 0.7 + rand.Float32()*0.3  // High
        target = 1.0
    } else {
        input[0] = rand.Float32()*0.3  // Low
        target = 0.0
    }
    trainData1[i] = nn.TrainingBatch{Input: input, Target: []float32{target}}
}
e1Net.Train(trainData1, &nn.TrainingConfig{Epochs: 10, LearningRate: 0.1})
expert1 = *e1Net.GetLayer(0, 0, 0)  // Get trained weights

// ============================================================
// STEP 2: Pre-train Expert 2 on "low signal" patterns
// ============================================================
expert2 := nn.InitDenseLayer(8, 8, nn.ActivationSigmoid)
// (similar training, but responds to low input[0])

// ============================================================
// STEP 3: Combine with filter layer, train ONLY the gate
// ============================================================
gateLayer := nn.InitDenseLayer(8, 2, nn.ActivationScaledReLU)

filterLayer := nn.LayerConfig{
    Type:              nn.LayerParallel,
    ParallelBranches:  []nn.LayerConfig{expert1, expert2},  // Pre-trained!
    CombineMode:       "filter",
    FilterGateConfig:  &gateLayer,  // This will be trained
    FilterSoftmax:     nn.SoftmaxStandard,
    FilterTemperature: 0.5,  // Sharper routing
}

net := nn.NewNetwork(8, 1, 1, 2)
net.SetLayer(0, 0, 0, filterLayer)
net.SetLayer(0, 0, 1, nn.InitDenseLayer(8, 1, nn.ActivationSigmoid))

// Train with tweening (gate learns to route to correct expert)
ts := nn.NewTweenState(net, nil)
for epoch := 0; epoch < 2000; epoch++ {
    input := randomInput(8)
    if epoch%2 == 0 {
        input[0] = 0.7 + rand.Float32()*0.3  // High → should route to expert1
    } else {
        input[0] = rand.Float32() * 0.3       // Low → should route to expert2
    }
    ts.TweenStep(net, input, 0, 1, 0.01)
}

After training, the gate will have learned: - High input[0] → route to Expert 1 - Low input[0] → route to Expert 2

This is essentially a learned IF/ELSE statement!

Freezing Layers: Train Only the Gate

Loom supports freezing layers so their weights don't update during training. This is essential for the filter pattern: freeze pre-trained experts, train only the gate.

// LayerConfig has a Frozen field
type LayerConfig struct {
    // ... other fields ...
    Frozen bool  // If true, weights will NOT be updated during training
}

Using the Frozen Field

// Freeze a single layer
expertLayer.Frozen = true

// Freeze recursively (for Sequential/Parallel layers)
func freezeLayer(cfg *nn.LayerConfig) {
    cfg.Frozen = true
    // Recurse into nested branches
    for i := range cfg.ParallelBranches {
        freezeLayer(&cfg.ParallelBranches[i])
    }
}

Complete Pattern: Frozen Experts + Trainable Gate

// ============================================================
// STEP 1: Create and pre-train Expert 1
// ============================================================
expert1 := nn.InitSequentialLayer(
    nn.InitDenseLayer(8, 8, nn.ActivationLeakyReLU),
    nn.InitDenseLayer(8, 1, nn.ActivationSigmoid),
)

// Train expert1 to respond to HIGH input[0]
e1Net := nn.NewNetwork(8, 1, 1, 1)
e1Net.SetLayer(0, 0, 0, expert1)
e1Net.Train(highPatternData, &nn.TrainingConfig{
    Epochs: 5, LearningRate: 0.05,
})
expert1 = *e1Net.GetLayer(0, 0, 0)

// ============================================================
// STEP 2: Create and pre-train Expert 2
// ============================================================
expert2 := nn.InitSequentialLayer(
    nn.InitDenseLayer(8, 8, nn.ActivationLeakyReLU),
    nn.InitDenseLayer(8, 1, nn.ActivationSigmoid),
)

// Train expert2 to respond to LOW input[0]
e2Net := nn.NewNetwork(8, 1, 1, 1)
e2Net.SetLayer(0, 0, 0, expert2)
e2Net.Train(lowPatternData, &nn.TrainingConfig{
    Epochs: 5, LearningRate: 0.05,
})
expert2 = *e2Net.GetLayer(0, 0, 0)

// ============================================================
// STEP 3: FREEZE both experts
// ============================================================
freezeLayer(&expert1)  // expert1.Frozen = true (recursive)
freezeLayer(&expert2)  // expert2.Frozen = true (recursive)

// ============================================================
// STEP 4: Create filter layer with frozen experts + trainable gate
// ============================================================
gateLayer := nn.InitDenseLayer(8, 2, nn.ActivationScaledReLU)
// Note: gateLayer.Frozen is FALSE (default) - it WILL be trained

filterLayer := nn.LayerConfig{
    Type:              nn.LayerParallel,
    ParallelBranches:  []nn.LayerConfig{expert1, expert2},  // FROZEN
    CombineMode:       "filter",
    FilterGateConfig:  &gateLayer,  // TRAINABLE
    FilterSoftmax:     nn.SoftmaxStandard,
    FilterTemperature: 0.5,
}

net := nn.NewNetwork(8, 1, 1, 1)
net.SetLayer(0, 0, 0, filterLayer)

// ============================================================
// STEP 5: Train - only gate weights will update!
// ============================================================
ts := nn.NewTweenState(net, nil)
ts.Config.UseChainRule = true

for epoch := 0; epoch < 1000; epoch++ {
    input := randomInput(8)
    if epoch%2 == 0 {
        input[0] = 0.9  // Should route to expert1
    } else {
        input[0] = 0.1  // Should route to expert2
    }

    ts.TweenStep(net, input, 0, 1, 0.05)
    // Gate learns to route correctly
    // Expert weights stay FROZEN - no updates
}

How Freezing Works Internally

When a layer has Frozen = true:

Text

During backward pass:

    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
    │   Input     │────▶│   Expert    │────▶│   Output    │
    └─────────────┘     │  (FROZEN)   │     └─────────────┘
                        └──────┬──────┘
                               │
                               ▼
                     Gradients still flow THROUGH
                     (for upstream layers)

                     But weights are NOT updated:
                     kernel_grad = 0
                     bias_grad = 0

The gradient passes through frozen layers (so upstream trainable layers can learn), but the frozen layer's own weights are never modified.

Visualization of Filter Training with Frozen Experts

Text

Forward Pass:
                         Input
                           │
       ┌───────────────────┼───────────────────┐
       │                   │                   │
       ▼                   ▼                   ▼
 ┌───────────┐      ┌───────────┐      ┌───────────┐
 │   Gate    │      │  Expert1  │      │  Expert2  │
 │(trainable)│      │ (FROZEN)  │      │ (FROZEN)  │
 └─────┬─────┘      └─────┬─────┘      └─────┬─────┘
       │                  │                  │
       ▼                  ▼                  ▼
   [0.8, 0.2]           [0.9]              [0.1]
       │                  │                  │
       └──────────────────┼──────────────────┘
                          │
                    0.8×0.9 + 0.2×0.1 = 0.74
                          │
                          ▼
                       Output


Backward Pass:
                      ∂Loss/∂Output
                           │
       ┌───────────────────┼───────────────────┐
       │                   │                   │
       ▼                   ▼                   ▼
 ┌───────────┐      ┌───────────┐      ┌───────────┐
 │   Gate    │      │  Expert1  │      │  Expert2  │
 │ UPDATED!  │      │ NOT UPD   │      │ NOT UPD   │
 │ dW = ...  │      │ dW = 0    │      │ dW = 0    │
 └───────────┘      └───────────┘      └───────────┘

Only the gate learns! Experts stay frozen.

Use Cases for Frozen Layers

Scenario	What to Freeze	What to Train
Pre-trained experts + gate	Expert branches	Gate layer only
Transfer learning	Base model layers	New head layers
Feature extraction	Encoder	Decoder
Fine-tuning on new task	Lower layers	Output layers
Debugging	Suspected broken layer	Rest of network

The Dynamic Logic Gate Concept

Filter mode enables networks that learn conditional logic:

Text

Traditional programming:
    if (input[0] > 0.5) {
        use_expert_1()
    } else {
        use_expert_2()
    }

Filter mode equivalent:
    Gate learns the decision boundary automatically!

    Different experts can specialize in:
    - Different input ranges
    - Different feature patterns
    - Different task types
    - Different modalities

Use Cases

Scenario	Experts	What Gate Learns
Multi-task learning	Task A expert, Task B expert	Which task this input belongs to
Feature specialization	Low-frequency expert, High-frequency expert	Signal characteristics
Temporal patterns	Recent-memory expert (Dense), Long-memory expert (LSTM)	Time horizon to focus on
Difficulty routing	Simple expert (small), Complex expert (deep)	Input complexity
Modality fusion	Image expert (Conv), Text expert (LSTM)	Which modality is more informative

Use when: - Building Mixture of Experts architectures - You want the network to learn conditional computation - Different inputs need fundamentally different processing - You want interpretable routing decisions

Case Study: Parallel KMeans Experts (The RN6 Pattern)

A powerful application of Parallel layers is using multiple KMeans "experts" in parallel. This is the core of the RN6 (Recursive Neuro-Symbolic 6) benchmark.

Text

                  Input
                    │
          ┌─────────┴─────────┐
          ▼                   ▼
    ┌───────────┐       ┌───────────┐
    │  Expert A │       │  Expert B │
    │ (KMeans)  │       │ (KMeans)  │
    └─────┬─────┘       └─────▼─────┘
          │                   │
          └─────────┬─────────┘
                    │
                 Combine
             (concat/filter)
                    │
                    ▼
               Dense Head
                    │
                    ▼
               Classification

Why do this? - Diverse Perspectives: Each expert can learn to cluster the data differently (e.g., one focusing on spatial proximity, another on feature-based similarity). - Redundancy & Reliability: If one expert fails to capture a complex boundary, others can compensate. - Interpretable MoE: In filter mode, you can see exactly which "cluster" or "concept" expert the model is choosing for any given input.

Stitching Oddly-Shaped Networks

One of Loom's unique capabilities: combine networks with different output sizes using stitch layers.

The Problem

Filter mode (and avg mode) require all branches to output the same size. But what if your pre-trained experts have different architectures?

Text

Expert A: input → 16 features
Expert B: input → 32 features
Expert C: input → 7 features

Filter mode needs all outputs to be the same size!

The Solution: Stitch Layers

Use InitStitchLayer() to project each expert's output to a common size:

Text

                    Input
                      │
        ┌─────────────┼─────────────┐
        │             │             │
        ▼             ▼             ▼
   ┌─────────┐   ┌─────────┐   ┌─────────┐
   │Expert A │   │Expert B │   │Expert C │
   │ (→16)   │   │ (→32)   │   │ (→7)    │
   └────┬────┘   └────┬────┘   └────┬────┘
        │             │             │
        ▼             ▼             ▼
   ┌─────────┐   ┌─────────┐   ┌─────────┐
   │Stitch   │   │Stitch   │   │Stitch   │
   │(16→10)  │   │(32→10)  │   │(7→10)   │
   └────┬────┘   └────┬────┘   └────┬────┘
        │             │             │
        ▼             ▼             ▼
      [10]          [10]          [10]   ← All same size now!
        │             │             │
        └─────────────┼─────────────┘
                      │
                 Filter Combine
                      │
                      ▼
                   [10]

InitStitchLayer

A stitch layer is a linear projection (Dense layer without activation):

// Create a stitch layer: 16 → 10
stitch := nn.InitStitchLayer(16, 10)

// Equivalent to:
stitch := nn.InitDenseLayer(16, 10, nn.ActivationType(-1)) // Linear

Building Stitched Branches

Wrap each expert with its stitch layer using InitSequentialLayer:

inputSize := 16
commonOutputSize := 10

// Expert 1: outputs 5 features → stitch to 10
expert1 := nn.InitDenseLayer(inputSize, 5, nn.ActivationLeakyReLU)
stitch1 := nn.InitStitchLayer(5, commonOutputSize)
branch1 := nn.InitSequentialLayer(expert1, stitch1)

// Expert 2: outputs 7 features → stitch to 10
expert2 := nn.InitDenseLayer(inputSize, 7, nn.ActivationSigmoid)
stitch2 := nn.InitStitchLayer(7, commonOutputSize)
branch2 := nn.InitSequentialLayer(expert2, stitch2)

// Now both branches output [10] - ready for filter combine!
gateLayer := nn.InitDenseLayer(inputSize, 2, nn.ActivationScaledReLU)

filterLayer := nn.LayerConfig{
    Type:              nn.LayerParallel,
    ParallelBranches:  []nn.LayerConfig{branch1, branch2},
    CombineMode:       "filter",
    FilterGateConfig:  &gateLayer,
    FilterSoftmax:     nn.SoftmaxStandard,
    FilterTemperature: 1.0,
}

Multi-Expert Stitching Example

Combine 4 experts with wildly different output sizes:

inputSize := 16
commonOutputSize := 8
expertSizes := []int{4, 12, 6, 20}  // Very different!

branches := make([]nn.LayerConfig, len(expertSizes))
for i, size := range expertSizes {
    expert := nn.InitDenseLayer(inputSize, size, nn.ActivationLeakyReLU)
    stitch := nn.InitStitchLayer(size, commonOutputSize)
    branches[i] = nn.InitSequentialLayer(expert, stitch)
}

gateLayer := nn.InitDenseLayer(inputSize, len(branches), nn.ActivationScaledReLU)

filterLayer := nn.LayerConfig{
    Type:              nn.LayerParallel,
    ParallelBranches:  branches,
    CombineMode:       "filter",
    FilterGateConfig:  &gateLayer,
    FilterSoftmax:     nn.SoftmaxEntmax,  // Sparse routing
    FilterTemperature: 0.5,               // Sharp selection
}

// All 4 experts (sizes 4, 12, 6, 20) now work together!

Pre-Training + Stitching + Gate Training

The complete pattern with oddly-shaped pre-trained networks:

// ============================================================
// STEP 1: Pre-train Expert 1 (outputs 3 features)
// ============================================================
expert1Core := nn.InitDenseLayer(8, 3, nn.ActivationSigmoid)
stitch1 := nn.InitStitchLayer(3, commonOutputSize)

net1 := nn.NewNetwork(8, 1, 1, 2)
net1.SetLayer(0, 0, 0, expert1Core)
net1.SetLayer(0, 0, 1, stitch1)

// Train on "high input" detection task
net1.Train(highInputData, &nn.TrainingConfig{
    Epochs: 10, LearningRate: 0.1,
})

// Bundle expert + stitch as one branch
branch1 := nn.InitSequentialLayer(
    *net1.GetLayer(0, 0, 0),  // Pre-trained expert
    *net1.GetLayer(0, 0, 1),  // Pre-trained stitch
)

// ============================================================
// STEP 2: Pre-train Expert 2 (outputs 5 features)
// ============================================================
expert2Core := nn.InitDenseLayer(8, 5, nn.ActivationSigmoid)
stitch2 := nn.InitStitchLayer(5, commonOutputSize)

net2 := nn.NewNetwork(8, 1, 1, 2)
net2.SetLayer(0, 0, 0, expert2Core)
net2.SetLayer(0, 0, 1, stitch2)

// Train on "low input" detection task
net2.Train(lowInputData, &nn.TrainingConfig{
    Epochs: 10, LearningRate: 0.1,
})

branch2 := nn.InitSequentialLayer(
    *net2.GetLayer(0, 0, 0),
    *net2.GetLayer(0, 0, 1),
)

// ============================================================
// STEP 3: Freeze experts, combine with filter, train gate
// ============================================================
freezeLayer(&branch1)
freezeLayer(&branch2)

gateLayer := nn.InitDenseLayer(8, 2, nn.ActivationScaledReLU)

filterLayer := nn.LayerConfig{
    Type:              nn.LayerParallel,
    ParallelBranches:  []nn.LayerConfig{branch1, branch2},
    CombineMode:       "filter",
    FilterGateConfig:  &gateLayer,  // Trainable!
    FilterSoftmax:     nn.SoftmaxStandard,
    FilterTemperature: 0.5,
}

net := nn.NewNetwork(8, 1, 1, 2)
net.SetLayer(0, 0, 0, filterLayer)
net.SetLayer(0, 0, 1, nn.InitDenseLayer(commonOutputSize, 1, nn.ActivationSigmoid))

// Train gate only
ts := nn.NewTweenState(net, nil)
ts.Config.UseChainRule = true

for epoch := 0; epoch < 1000; epoch++ {
    input := randomInput(8)
    input[0] = (epoch%2 == 0) ? 0.9 : 0.1  // Alternate high/low
    ts.TweenStep(net, input, 0, 1, 0.05)
}

When to Use Stitching

Scenario	Solution
Experts have different output sizes	Stitch to common size
Loading pre-trained models with different architectures	Stitch before combining
Ensemble of heterogeneous models	Stitch outputs, then average or filter
Transfer learning from models with different feature dimensions	Stitch to target dimension

Creating Parallel Layers

Basic Parallel (Concat)

// Create a parallel layer with 3 branches
parallel := nn.InitParallelLayer()
parallel.CombineMode = "concat"

// Add branches of different types
parallel.ParallelBranches = []nn.LayerConfig{
    nn.InitDenseLayer(64, 32, nn.ActivationReLU),      // Dense branch
    nn.InitLSTMLayer(64, 32, 1, 10),                   // LSTM branch  
    nn.InitMultiHeadAttentionLayer(64, 4, 1, 16),      // Attention branch
}

// Add to network
network.SetLayer(0, 0, 1, parallel)

// Output size: 32 + 32 + 64 = 128 (concatenated)

Parallel with Add Mode

parallel := nn.InitParallelLayer()
parallel.CombineMode = "add"

// All branches must output same size!
parallel.ParallelBranches = []nn.LayerConfig{
    nn.InitDenseLayer(64, 32, nn.ActivationReLU),
    nn.InitDenseLayer(64, 32, nn.ActivationTanh),
    nn.InitDenseLayer(64, 32, nn.ActivationLeakyReLU),
}

// Output size: 32 (element-wise sum)

Grid Scatter

parallel := nn.InitParallelLayer()
parallel.CombineMode = "grid_scatter"
parallel.GridOutputRows = 2
parallel.GridOutputCols = 2
parallel.GridOutputLayers = 1

// Position each branch in the grid
parallel.GridPositions = []nn.GridPosition{
    {TargetRow: 0, TargetCol: 0, TargetLayer: 0},  // Branch 0 → top-left
    {TargetRow: 0, TargetCol: 1, TargetLayer: 0},  // Branch 1 → top-right
    {TargetRow: 1, TargetCol: 0, TargetLayer: 0},  // Branch 2 → bottom-left
    {TargetRow: 1, TargetCol: 1, TargetLayer: 0},  // Branch 3 → bottom-right
}

parallel.ParallelBranches = []nn.LayerConfig{
    nn.InitDenseLayer(64, 16, nn.ActivationReLU),
    nn.InitLSTMLayer(64, 16, 1, 5),
    nn.InitDenseLayer(64, 16, nn.ActivationTanh),
    nn.InitMultiHeadAttentionLayer(64, 2, 1, 8),
}

Filtered Parallel (Mixture of Experts)

// Create MoE with softmax gating
branches := []nn.LayerConfig{
    nn.InitDenseLayer(64, 32, nn.ActivationReLU),     // Expert 0
    nn.InitDenseLayer(64, 32, nn.ActivationReLU),     // Expert 1
    nn.InitDenseLayer(64, 32, nn.ActivationReLU),     // Expert 2
    nn.InitDenseLayer(64, 32, nn.ActivationReLU),     // Expert 3
}

moe := nn.InitFilteredParallelLayer(
    branches,          // The expert branches
    64,               // Input size for gate network
    nn.SoftmaxStandard,  // Softmax type for gating
    1.0,              // Temperature (1.0 = no scaling)
)

network.SetLayer(0, 0, 1, moe)

Heterogeneous Architectures

The real power: Each branch can be a completely different architecture.

Multi-Modal Fusion

Text

Input: [image_features | text_features | audio_features]

┌────────────────────────────────────────────────────────────────┐
│                     Parallel Layer                             │
│                                                                │
│  Branch 0: Conv2D           Branch 1: LSTM        Branch 2:    │
│  (for image)               (for text)            Dense+MHA    │
│                                                  (for audio)   │
│  ┌──────────────┐          ┌──────────────┐     ┌────────────┐│
│  │ Conv2D 3×3   │          │ LSTM 64→32   │     │ Dense 64→32││
│  │ Filters=16   │          │ SeqLen=10    │     │ MHA heads=4││
│  └──────┬───────┘          └──────┬───────┘     └─────┬──────┘│
│         │                         │                   │       │
└─────────┼─────────────────────────┼───────────────────┼───────┘
          │                         │                   │
          └─────────────────────────┼───────────────────┘
                                    │
                              Concat: 256 + 32 + 32 = 320

Expert Specialization

Different experts for different input types:

// Fast expert (small, quick)
fastExpert := nn.InitDenseLayer(64, 32, nn.ActivationReLU)

// Deep expert (more layers, better quality)
deepExpert := nn.InitSequentialLayer()
deepExpert.ParallelBranches = []nn.LayerConfig{
    nn.InitDenseLayer(64, 128, nn.ActivationReLU),
    nn.InitDenseLayer(128, 64, nn.ActivationReLU),
    nn.InitDenseLayer(64, 32, nn.ActivationReLU),
}

// Memory expert (LSTM for temporal patterns)
memoryExpert := nn.InitLSTMLayer(64, 32, 1, 10)

// Combine with gating
moe := nn.InitFilteredParallelLayer(
    []nn.LayerConfig{fastExpert, deepExpert, memoryExpert},
    64, nn.SoftmaxTemperature, 0.5,  // Low temp = sharper routing
)

Nested Parallel Layers

Parallel layers can contain other parallel layers:

Text

                              Input
                                │
                    ┌───────────┼───────────┐
                    │           │           │
                    ▼           ▼           ▼
              ┌─────────┐  ┌─────────┐  ┌─────────┐
              │Parallel │  │  Dense  │  │  LSTM   │
              │  (MoE)  │  │         │  │         │
              │ ┌─┬─┬─┐ │  │         │  │         │
              │ │E│E│E│ │  │         │  │         │
              │ └─┴─┴─┘ │  │         │  │         │
              └────┬────┘  └────┬────┘  └────┬────┘
                   │            │            │
                   └────────────┼────────────┘
                                │
                            Combine

// Inner parallel (level 1 MoE)
innerMoE := nn.InitFilteredParallelLayer(
    []nn.LayerConfig{
        nn.InitDenseLayer(64, 32, nn.ActivationReLU),
        nn.InitDenseLayer(64, 32, nn.ActivationReLU),
        nn.InitDenseLayer(64, 32, nn.ActivationReLU),
    },
    64, nn.SoftmaxStandard, 1.0,
)

// Outer parallel (combines MoE with other branches)
outerParallel := nn.InitParallelLayer()
outerParallel.CombineMode = "concat"
outerParallel.ParallelBranches = []nn.LayerConfig{
    innerMoE,                                          // Nested MoE
    nn.InitDenseLayer(64, 32, nn.ActivationReLU),      // Simple dense
    nn.InitLSTMLayer(64, 32, 1, 10),                   // LSTM
}

How Gradients Flow

Gradients flow through parallel layers differently based on combine mode:

Concat Mode

Text

Gradient splits by output regions:

gradOutput = [g0, g1, g2, g3, g4, g5, g6, g7, g8]
                 │        │        │
                 ▼        ▼        ▼
              Branch 0  Branch 1  Branch 2
              gets      gets      gets
              [g0,g1,g2] [g3,g4]  [g5,g6,g7,g8]

Add Mode

Text

Each branch gets full gradient:

gradOutput = [g0, g1, g2]
                  │
    ┌─────────────┼─────────────┐
    ▼             ▼             ▼
 Branch 0      Branch 1      Branch 2
 [g0,g1,g2]    [g0,g1,g2]    [g0,g1,g2]

All branches receive identical gradients.

Filter Mode

Text

Each branch gets gradient weighted by its gate value:

gradOutput = [g0, g1, g2]
gateWeights = [0.6, 0.3, 0.1]

Branch 0: [g0*0.6, g1*0.6, g2*0.6]
Branch 1: [g0*0.3, g1*0.3, g2*0.3]
Branch 2: [g0*0.1, g1*0.1, g2*0.1]

Gate also gets gradients to learn better routing.

Auto-Padding in Filter Mode

Filter mode automatically pads smaller outputs to match the largest:

Text

Branch 0 output: [1.0, 2.0, 3.0, 4.0]     (4 values)
Branch 1 output: [5.0, 6.0]               (2 values)
Branch 2 output: [7.0, 8.0, 9.0]          (3 values)

After auto-padding:
Branch 0: [1.0, 2.0, 3.0, 4.0]   (unchanged)
Branch 1: [5.0, 6.0, 0.0, 0.0]   (padded with zeros)
Branch 2: [7.0, 8.0, 9.0, 0.0]   (padded with zeros)

Now weighted sum works element-wise.

This allows mixing branches of different output sizes in filter mode.

Sequential Layers

For completeness: Sequential layers run branches in order, not parallel:

Text

Input → Branch0 → Branch1 → Branch2 → Output

Each branch's output becomes the next branch's input.

sequential := nn.InitSequentialLayer()
sequential.ParallelBranches = []nn.LayerConfig{
    nn.InitDenseLayer(64, 128, nn.ActivationReLU),
    nn.InitDenseLayer(128, 64, nn.ActivationReLU),
    nn.InitDenseLayer(64, 32, nn.ActivationReLU),
}

This is useful inside parallel branches when you want a multi-layer expert.

Observers for Parallel Layers

You can attach observers to monitor each branch:

parallel.Observer = nn.NewConsoleObserver()

// During forward pass, you'll see output from each branch:
// [Branch 0] Dense: 64 → 32, output mean=0.12, std=0.45
// [Branch 1] LSTM: 64 → 32, output mean=0.08, std=0.32
// [Branch 2] MHA: 64 → 64, output mean=0.15, std=0.51

Practical Example: Multi-Agent System

// Input: game state (64 features)
// Output: 3 agents × 4 actions = 12 action probabilities

network := nn.NewNetwork(64, 1, 1, 4)

// Shared feature extraction
network.SetLayer(0, 0, 0, nn.InitDenseLayer(64, 32, nn.ActivationReLU))

// Parallel agent heads - each agent has different architecture
agentHeads := nn.InitParallelLayer()
agentHeads.CombineMode = "concat"
agentHeads.ParallelBranches = []nn.LayerConfig{
    // Agent 0: Fast reactive (Dense)
    nn.InitDenseLayer(32, 4, nn.ActivationReLU),

    // Agent 1: Memory-based (LSTM)
    nn.InitSequentialLayer(),  // Contains LSTM + Dense

    // Agent 2: Attention-based (MHA + Dense)
    nn.InitSequentialLayer(),  // Contains MHA + Dense
}

// Configure Agent 1's sequential branch
agentHeads.ParallelBranches[1].ParallelBranches = []nn.LayerConfig{
    nn.InitLSTMLayer(32, 16, 1, 5),
    nn.InitDenseLayer(16, 4, nn.ActivationReLU),
}

// Configure Agent 2's sequential branch
agentHeads.ParallelBranches[2].ParallelBranches = []nn.LayerConfig{
    nn.InitMultiHeadAttentionLayer(32, 2, 1, 8),
    nn.InitDenseLayer(32, 4, nn.ActivationReLU),
}

network.SetLayer(0, 0, 1, agentHeads)

// Grid softmax: 3 agents × 4 actions
network.SetLayer(0, 0, 2, nn.InitGridSoftmaxLayer(3, 4))

// Each agent now has:
// - Shared feature extraction (layer 0)
// - Specialized decision making (parallel branches)
// - Independent action probabilities (grid softmax)

Summary

Parallel layers enable:

Multiple Branch Types - Mix Dense, Conv, LSTM, Attention, etc. - Each branch can have different architecture

Combine Modes - concat: Concatenate all outputs - add: Element-wise sum (same size required) - avg: Element-wise average - grid_scatter: Place at 2D/3D positions - filter: Softmax-gated weighted sum (MoE)

Nesting - Parallel can contain Parallel - Sequential branches for multi-layer experts - Hierarchical MoE architectures

Auto-Features - Auto-padding for filter mode - Gradient routing handled automatically - Observer support for debugging

Use parallel layers to build ensemble models, mixture of experts, multi-modal fusion, and agent-based systems.