Understanding Layer Types

This guide explains what each layer type actually does—not just the math, but the intuition behind why these layers exist and when you'd use them.

Dense Layers: The Universal Connector

A Dense layer (also called "fully connected" or "linear") is the simplest and most fundamental building block. Every input connects to every output.

What It Actually Does

Imagine you have 4 inputs and want 3 outputs:

Text

    Inputs                    Outputs

    x₁ ─────┬────┬────┬────▶ y₁
            │    │    │
    x₂ ─────┼────┼────┼────▶ y₂
            │    │    │
    x₃ ─────┼────┼────┼────▶ y₃
            │    │    │
    x₄ ─────┴────┴────┴────▶ (every input connects to every output)

    Total connections: 4 × 3 = 12 weights
    Plus 3 biases (one per output)

Each output is computed as:

Text

y₁ = activation( w₁₁×x₁ + w₁₂×x₂ + w₁₃×x₃ + w₁₄×x₄ + b₁ )
y₂ = activation( w₂₁×x₁ + w₂₂×x₂ + w₂₃×x₃ + w₂₄×x₄ + b₂ )
y₃ = activation( w₃₁×x₁ + w₃₂×x₂ + w₃₃×x₃ + w₃₄×x₄ + b₃ )

Why It's Called "Dense"

Because the weight matrix is dense—every possible connection exists. This is the opposite of sparse connections (like Conv2D) where only local neighborhoods connect.

The Weight Matrix Visualized

Text

               Input features (1024)
              ┌─────────────────────────┐
              │                         │
              ▼                         ▼
           ┌──────────────────────────────┐
Row 0:     │ w₀,₀  w₀,₁  w₀,₂  ...  w₀,₁₀₂₃│ → Output 0
Row 1:     │ w₁,₀  w₁,₁  w₁,₂  ...  w₁,₁₀₂₃│ → Output 1
Row 2:     │ w₂,₀  w₂,₁  w₂,₂  ...  w₂,₁₀₂₃│ → Output 2
  ...      │  ...   ...   ...        ...   │
Row 511:   │ w₅₁₁,₀ ...             w₅₁₁,₁₀₂₃│ → Output 511
           └──────────────────────────────┘

Matrix shape: [512 outputs × 1024 inputs] = 524,288 weights

Each row computes one output neuron.
Each column represents how much one input affects all outputs.

When to Use Dense Layers

Classification heads: Map features to class probabilities
Fully connected networks: Simple stacked architectures
Dimensionality changes: Go from 1024 features to 256, or vice versa
After flattening: Following Conv2D layers in CNNs

Conv2D Layers: Finding Patterns in Images

Convolutional layers look for local patterns that can appear anywhere in an image. Instead of connecting everything to everything, they slide a small "kernel" (also called "filter") across the image.

The Key Insight

Consider detecting a vertical edge. This pattern:

Text

Dark | Light | Dark
  0  |   1   |  0
  0  |   1   |  0
  0  |   1   |  0

Can appear anywhere in an image. A Dense layer would need to learn separate weights for each position. A Conv2D layer learns the pattern once and slides it across the entire image.

How Convolution Works

Text

Input Image (4×4):                  Kernel (3×3):
┌─────┬─────┬─────┬─────┐          ┌─────┬─────┬─────┐
│  1  │  2  │  3  │  4  │          │ a   │ b   │ c   │
├─────┼─────┼─────┼─────┤          ├─────┼─────┼─────┤
│  5  │  6  │  7  │  8  │          │ d   │ e   │ f   │
├─────┼─────┼─────┼─────┤          ├─────┼─────┼─────┤
│  9  │ 10  │ 11  │ 12  │          │ g   │ h   │ i   │
├─────┼─────┼─────┼─────┤          └─────┴─────┴─────┘
│ 13  │ 14  │ 15  │ 16  │
└─────┴─────┴─────┴─────┘

Step 1: Place kernel at top-left

┌─────┬─────┬─────┐─────┐
│  1  │  2  │  3  │  4  │     Output[0,0] = 
├─────┼─────┼─────┤─────┤       1×a + 2×b + 3×c +
│  5  │  6  │  7  │  8  │       5×d + 6×e + 7×f +
├─────┼─────┼─────┤─────┤       9×g + 10×h + 11×i
│  9  │ 10  │ 11  │ 12  │
└─────┴─────┴─────┘─────┘

Step 2: Slide kernel one position right

┌─────┬─────┬─────┬─────┐
│  1  │  2  │  3  │  4  │     Output[0,1] = 
├─────┼─────┼─────┼─────┤       2×a + 3×b + 4×c +
│  5  │  6  │  7  │  8  │       6×d + 7×e + 8×f +
├─────┴─────┼─────┼─────┤       10×g + 11×h + 12×i
            │     │     │
            └─────┴─────┘

Continue until covering entire image...

Output (2×2):
┌─────┬─────┐
│ o₀₀ │ o₀₁ │
├─────┼─────┤
│ o₁₀ │ o₁₁ │
└─────┴─────┘

Stride and Padding

Stride controls how far the kernel moves each step:

Text

Stride = 1: Kernel moves 1 pixel at a time (detailed output)
Stride = 2: Kernel moves 2 pixels at a time (smaller output)

Stride 1:                    Stride 2:
Step: 0 1 2 3               Step: 0   2
      ▼ ▼ ▼ ▼                     ▼   ▼
      ■□□□□                       ■□□□□
       ■□□□                        ■
        ■□□                         
         ■□

Padding adds pixels around the edges:

Text

Without Padding:              With Padding (1 pixel):
Input: 4×4                   Input: 4×4 → Padded: 6×6
Kernel: 3×3                  Kernel: 3×3
Output: 2×2                  Output: 4×4 (same as input!)

                             ┌───┬───┬───┬───┬───┬───┐
                             │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │
                             ├───┼───┼───┼───┼───┼───┤
                             │ 0 │   │   │   │   │ 0 │
                             ├───┤   Input   │───┤
                             │ 0 │    4×4    │ 0 │
                             ├───┤           │───┤
                             │ 0 │   │   │   │   │ 0 │
                             ├───┼───┼───┼───┼───┼───┤
                             │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │
                             └───┴───┴───┴───┴───┴───┘

Multiple Filters = Multiple Feature Maps

Real Conv2D layers have multiple filters, each detecting a different pattern:

Text

Input Image                 Filters                    Feature Maps
(28×28×1)                  (3×3×1 each)               (26×26×32)

    │                    ┌──────────┐                   Output
    │                    │ Filter 0 │───▶ Edge detector
    │                    ├──────────┤                   
    └───────────────────▶│ Filter 1 │───▶ Blob detector   32 different
                         ├──────────┤                     feature maps
                         │ Filter 2 │───▶ Corner detector
                         ├──────────┤
                         │   ...    │
                         ├──────────┤
                         │Filter 31 │───▶ Some pattern
                         └──────────┘

Conv1D Layers: Local Patterns in Sequences

Conv1D is the 1D version of convolution. Instead of sliding a 2D kernel across an image, it slides a 1D kernel along a sequence (audio, time series, token embeddings). The key idea is the same: learn a local pattern once and reuse it everywhere.

Text

Input (channels × seqLen)          Kernel (channels × kernelSize)
┌───────────────┐                 ┌───────────────┐
│ c0: 1 2 3 4 5 │   ─────▶         │ c0: a b c     │
│ c1: 6 7 8 9 0 │                 │ c1: d e f     │
└───────────────┘                 └───────────────┘

Output length:

Text

outLen = (seqLen + 2*padding - kernelSize) / stride + 1

Use Conv1D when: - Sequences have local structure (phonemes, n-grams, sensor spikes) - You want translation invariance along time - You need efficient feature extraction before RNN/Attention

Embedding Layers: Token Lookup Tables

Embeddings convert discrete IDs (tokens) into dense vectors. Think of it as a learned dictionary:

Text

Embedding table: [vocabSize × embeddingDim]
Token ID 42 → row 42 → vector of length embeddingDim

Only the rows used in the current batch receive gradients. This makes embedding updates efficient even with huge vocabularies.

Use embeddings when: - Inputs are categorical (tokens, IDs, symbols) - You need a learned representation before sequence models - You want compact, trainable vector spaces

SwiGLU: Gated MLP Blocks

SwiGLU is a modern gated feedforward block used in LLMs. It uses two projections plus a gating nonlinearity:

Text

gate = SiLU(x · W_gate + b_gate)
up   = x · W_up + b_up
out  = (gate ⊙ up) · W_down + b_down

This gating tends to train better than a plain ReLU MLP, especially in deep transformer stacks.

Use SwiGLU when: - Building transformer-style feedforward blocks - You want stronger expressivity than a single dense layer

Multi-Head Attention: Learning What to Focus On

Attention is the mechanism that powers Transformers. It lets the network decide which parts of the input are relevant for each output.

The Core Idea

Imagine reading the sentence: "The cat sat on the mat because it was tired."

When interpreting "it", you need to figure out what "it" refers to. Attention learns to look back at "cat" when processing "it".

Text

Input sequence: [The] [cat] [sat] [on] [the] [mat] [because] [it] [was] [tired]
                  ↑     ↑                                      |
                  └─────┴──────────────────────────────────────┘
                        "it" attends strongly to "cat"

Query, Key, Value: The Mechanism

Attention uses three projections of each token: - Query (Q): "What am I looking for?" - Key (K): "What do I contain?" - Value (V): "What information do I provide?"

Text

Each token gets transformed into Q, K, and V vectors:

Token "it"           Token "cat"
    │                    │
    ▼                    ▼
┌───────┐            ┌───────┐
│ Q: "?" │            │ K: "animal"│
│ K: "pronoun"│      │ V: cat info │
│ V: "it"│            └───────┘
└───────┘
    │                    │
    │    Attention       │
    │    Computation:    │
    │                    │
    │   score = Q · K    │
    │   weight = softmax(score)
    │   output = weight × V
    │                    │
    ▼                    ▼
"it" looks at "cat" with high weight,
retrieves cat's information

Multi-Head: Multiple Perspectives

"Multi-head" means running multiple attention computations in parallel, each learning different relationships:

Text

Input: [batch, sequence, 512]
        │
        │ Split into 8 heads (512 ÷ 8 = 64 dims each)
        ▼
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│Head 0│Head 1│Head 2│Head 3│Head 4│Head 5│Head 6│Head 7│
│64d   │64d   │64d   │64d   │64d   │64d   │64d   │64d   │
│      │      │      │      │      │      │      │      │
│syntax│coref │topic │entity│tense │...   │...   │...   │
└──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┘
   │      │      │      │      │      │      │      │
   └──────┴──────┴──────┼──────┴──────┴──────┴──────┘
                        │
                        ▼ Concatenate
                   [batch, seq, 512]
                        │
                        ▼ Output projection
                   [batch, seq, 512]

Each head can specialize: one might track grammatical relationships, another coreference, another topic similarity.

The Attention Formula

Text

Attention(Q, K, V) = softmax(Q × Kᵀ / √d) × V

Where:
- Q × Kᵀ produces [seq, seq] matrix of "how much does each token attend to each other"
- √d is a scaling factor to prevent dot products from getting too large
- softmax normalizes each row to sum to 1 (probabilistic attention)
- × V retrieves the actual information based on attention weights

Visual:

Text

        Keys (all tokens)
        ┌───┬───┬───┬───┐
Queries │0.1│0.7│0.1│0.1│ "it" → mostly attends to "cat"
(all    ├───┼───┼───┼───┤
tokens) │0.3│0.3│0.2│0.2│ "cat" → attends somewhat evenly
        ├───┼───┼───┼───┤
        │0.2│0.2│0.3│0.3│ "sat" → attends to later tokens
        └───┴───┴───┴───┘

        Each row sums to 1.0 (softmax normalization)

RNN: Remembering the Past

Recurrent Neural Networks process sequences by maintaining a "hidden state" that summarizes everything seen so far.

The Key Insight

Dense layers have no memory—they process each input independently. RNNs carry information forward through time.

Text

Without RNN (Dense):

    Token 1 → [Dense] → Output 1    (no connection)
    Token 2 → [Dense] → Output 2    (no connection)
    Token 3 → [Dense] → Output 3    (can't see tokens 1 or 2!)


With RNN:

    Token 1 → [RNN] → Output 1
                │
                ▼ hidden state
    Token 2 → [RNN] → Output 2
                │
                ▼ hidden state carries info from tokens 1-2
    Token 3 → [RNN] → Output 3  ← can "remember" earlier tokens!

How the Hidden State Works

At each time step, the RNN combines: 1. Current input 2. Previous hidden state

Text

Time step t:

                Previous hidden state
                h_{t-1}
                    │
                    ▼
              ┌───────────────────┐
Current  ───▶ │                   │ ───▶ Output y_t
input x_t     │   h_t = tanh(     │
              │     W_ih × x_t +  │
              │     W_hh × h_{t-1}│
              │     + bias )      │
              └─────────┬─────────┘
                        │
                        ▼
                New hidden state h_t
                (passed to next step)

Unrolled Through Time

Text

Sequence: [x₁, x₂, x₃, x₄]

     x₁        x₂        x₃        x₄
      │         │         │         │
      ▼         ▼         ▼         ▼
   ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
   │ RNN  │→│ RNN  │→│ RNN  │→│ RNN  │
   │ Cell │ │ Cell │ │ Cell │ │ Cell │
   └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘
      │         │         │         │
      ▼         ▼         ▼         ▼
     y₁        y₂        y₃        y₄

   h₀→      h₁→      h₂→      h₃→      h₄
   (init)                              (final)

The same weights are used at every time step—the network "time-shares" its parameters.

LSTM: Solving the Vanishing Gradient Problem

Standard RNNs struggle with long sequences. Gradients either vanish (become tiny) or explode (become huge) when backpropagating through many time steps.

LSTM (Long Short-Term Memory) solves this with gates—learned mechanisms that control information flow.

The Three Gates

Text

LSTM Cell:

                      ┌─────────────────────────────────────────┐
                      │                                         │
    ┌─────────────────┼──────────────────────────────────────┐  │
    │                 ▼                                      │  │
    │   ┌──────────────────────────────────────────────┐     │  │
    │   │                 Cell State (cₜ)              │     │  │
    │   │  "The highway for information"               │     │  │
    │   └───────┬──────────┬──────────┬────────────────┘     │  │
    │           │          │          │                      │  │
    │           │          ▼          │                      │  │
    │     ┌─────▼─────┐  ┌───┐  ┌─────▼─────┐               │  │
    │     │  Forget   │  │ + │  │  Input    │               │  │
    │     │   Gate    │  │   │  │   Gate    │               │  │
    │     │  (what to │  └─┬─┘  │  (what to │               │  │
    │     │   forget) │    │    │   add)    │               │  │
    │     └─────┬─────┘    │    └─────┬─────┘               │  │
    │           │          │          │                      │  │
    │           │    ┌─────┴─────┐    │                      │  │
    │           │    │  tanh     │    │                      │  │
    │           ▼    │ (new info)│    ▼                      │  │
    │     ┌───────┐  └─────┬─────┘  ┌───────┐               │  │
    │     │ ×     │        │        │  ×    │               │  │
    │     └───┬───┘        │        └───┬───┘               │  │
    │         │            │            │                    │  │
    │         └────────────┼────────────┘                    │  │
    │                      │                                 │  │
    │                      ▼                                 │  │
    │                ┌───────────┐                           │  │
    │                │  Output   │                           │  │
    │                │   Gate    │◀──────────────────────────┘  │
    │                │ (what to  │                              │
    │                │  output)  │                              │
    │                └─────┬─────┘                              │
    │                      │                                    │
    │                      ▼                                    │
    │                 Hidden State (hₜ)                         │
    │                      │                                    │
    └──────────────────────┼────────────────────────────────────┘
                           │
                           ▼
                       Output

What Each Gate Does

Forget Gate: "What should I throw away from the cell state?"

Text

f_t = σ(W_f × [h_{t-1}, x_t] + b_f)

Output: values between 0 and 1
- 0 means "forget completely"
- 1 means "remember completely"

Example: Processing a period "." might signal to forget the subject
of the previous sentence.

Input Gate: "What new information should I store?"

Text

i_t = σ(W_i × [h_{t-1}, x_t] + b_i)     ← how much to add
g_t = tanh(W_g × [h_{t-1}, x_t] + b_g)  ← what to add

Example: Seeing a new subject noun might store that in the cell state.

Output Gate: "What should I output based on the cell state?"

Text

o_t = σ(W_o × [h_{t-1}, x_t] + b_o)
h_t = o_t × tanh(c_t)

Example: When generating a verb, output information about the subject
(for agreement) but not necessarily everything in the cell.

Why This Solves Vanishing Gradients

The cell state has a "highway" path through time:

Text

c₀ ──×f₁──+──×f₂──+──×f₃──+──×f₄──+── c₄

The gradient can flow through the + operations almost unchanged!
Unlike vanilla RNN where gradients must pass through tanh repeatedly.

KMeans Layer: Learnable Concept Prototypes

The KMeans layer is a differentiable clustering module that organizes data into "concepts" or "prototypes." Unlike standard layers that learn abstract weights, a KMeans layer learns geometric centers in a feature space.

How It Works

Feature Extraction: The input passes through an attached sub-network (which can be any layer: Dense, Conv, etc.) to extract features ($z$).
Distance Calculation: The layer computes the distance between the extracted features ($z$) and a set of learnable cluster centers ($C$).
Soft Assignment: Distances are converted into probabilities (or feature assignments) via a specialized softmax: $$P(\text{cluster } k) = \frac{\exp(-\text{dist}(z, c_k) / \tau)}{\sum_j \exp(-\text{dist}(z, c_j) / \tau)}$$

The Power of Recursion

KMeans layers in Loom are recursive. You can use a KMeans layer as the attached sub-network for another KMeans layer.

Text

Input ──▶ [ KMeans L1: Finds "Edges" ] ──▶ [ KMeans L2: Finds "Shapes" ] ──▶ Output

Output Modes

probabilities: Outputs a probability distribution over the $K$ clusters (the "assignment").
features: Outputs the actual feature vector extracted by the sub-network.
reconstruction: (Experimental) Outputs the coordinates of the nearest cluster center.

When to Use KMeans Layers

Interpretability: The learned cluster centers are actual points in the feature space that represent prototypical examples.
Hierarchical Classification: Building "Concept Taxonomies" (e.g., Species inside Kingdoms).
Out-of-Distribution Detection: Large distances to all learned clusters indicate "unknown" data.
Neuro-Symbolic Reasoning: Bridging continuous neural features with discrete symbolic categories.

Softmax: Turning Numbers into Probabilities

Softmax converts a vector of arbitrary real numbers into a probability distribution (values between 0 and 1 that sum to 1).

The Basic Transformation

Text

Input (logits):  [2.0, 1.0, 0.1]
                   │
                   ▼ exp(each value)
             [7.39, 2.72, 1.11]
                   │
                   ▼ divide by sum (11.22)
Output:      [0.66, 0.24, 0.10]
             ─────────────────
               sums to 1.0 ✓

The largest input gets the largest probability, but all outputs are positive and normalized.

Why Not Just Normalize Directly?

The exponential has important properties: 1. Always positive: Even negative inputs become positive 2. Amplifies differences: Large inputs dominate 3. Smooth gradients: Differentiable everywhere

Text

Without exp:          With exp (softmax):
[2, 1, -1] → ?       [2, 1, -1] → [7.39, 2.72, 0.37]
Can't normalize      All positive! Can normalize.
when there are       
negatives.           

[10, 10, 10] → [.33, .33, .33]   Equal values → equal probs ✓
[100, 0, 0] → [.99, .004, .004]  Large diff → confident prediction

Loom's 10 Softmax Variants

Loom treats Softmax as a first-class layer with multiple variants:

Text

Standard Softmax:
[logits] → [probabilities that sum to 1]

Grid Softmax (Native Mixture of Experts!):
┌──────────────────────────────────────┐
│  Expert 0: [0.1, 0.2, 0.3, 0.4] = 1  │
│  Expert 1: [0.5, 0.2, 0.2, 0.1] = 1  │  Each ROW sums to 1
│  Expert 2: [0.1, 0.1, 0.1, 0.7] = 1  │  independently
└──────────────────────────────────────┘

Temperature Softmax:
- Low temperature (0.1): Sharp, confident, picks one option
- High temperature (2.0): Smooth, uncertain, spreads probability

Masked Softmax:
[logits] + [mask: True, False, True, True]
         → [0.33, 0.00, 0.33, 0.34]
           Masked positions get zero probability
           (Useful for: legal moves in games)

Sparsemax:
Like softmax, but can produce exact zeros
[logits] → [0.6, 0.4, 0.0, 0.0]
            Interpretable! Only a few options selected.

Normalization Layers: Keeping Activations Stable

As data flows through many layers, values can drift—becoming very large or very small. Normalization layers re-center and re-scale activations.

Layer Normalization

Normalizes across the feature dimension for each sample independently:

Text

Input: [batch, features]
        ┌────────────────────────────┐
Sample 0│ 100 │ -50 │  25 │  75 │   │ ← Normalize this row
        ├────────────────────────────┤
Sample 1│  10 │  20 │  30 │  40 │   │ ← Normalize this row
        ├────────────────────────────┤
Sample 2│ 0.1 │ 0.2 │ 0.3 │ 0.4 │   │ ← Normalize this row
        └────────────────────────────┘

For each sample:
1. Compute mean: μ = (100 + -50 + 25 + 75) / 4 = 37.5
2. Compute std:  σ = sqrt(variance) = ~54.5
3. Normalize:    (x - μ) / σ

Output: values with mean≈0, std≈1

Plus learnable parameters:
output = γ × normalized + β
(γ and β are learned per feature)

RMS Normalization (Llama-style)

Like LayerNorm but simpler—only divides by root-mean-square, no mean subtraction:

Text

rms = sqrt(mean(x²))
output = x / rms × γ

Why use it?
- Slightly faster (no mean computation)
- Works well empirically for modern LLMs
- Used in Llama, Mistral, etc.

Why Normalization Matters

Text

Without Normalization:         With Normalization:

Layer 1 output: ~100           Layer 1 output: ~100 → norm → ~0
Layer 2 output: ~10000         Layer 2 output: ~0 → x → ~0 → norm → ~0
Layer 3 output: ~1000000       Layer 3 output: ~0 → x → ~0 → norm → ~0
...                            ...
Values explode!                Values stay controlled throughout
Training becomes unstable      Training is stable

Structural Layers: Composing Complex Architectures

Sequential Layer

Chains sub-layers one after another:

Text

Sequential([Dense(512), ReLU(), Dense(256), ReLU(), Dense(10)])

    Input
      │
      ▼
  ┌────────────┐
  │Dense(512)  │
  └─────┬──────┘
        │
        ▼
  ┌────────────┐
  │   ReLU     │
  └─────┬──────┘
        │
        ▼
  ┌────────────┐
  │Dense(256)  │
  └─────┬──────┘
        │
        ▼
  ┌────────────┐
  │   ReLU     │
  └─────┬──────┘
        │
        ▼
  ┌────────────┐
  │Dense(10)   │
  └─────┬──────┘
        │
        ▼
    Output

Parallel Layer

Runs multiple branches simultaneously, then combines results:

Text

          Input
            │
     ┌──────┼──────┐
     │      │      │
     ▼      ▼      ▼
  ┌─────┐┌─────┐┌─────┐
  │LSTM ││Dense││Conv │   Three different "experts"
  └──┬──┘└──┬──┘└──┬──┘   process the same input
     │      │      │
     └──────┼──────┘
            │
            ▼
       ┌─────────┐
       │ Combine │
       └────┬────┘
            │
            ▼
         Output

Combine modes:
- concat: Concatenate all outputs [lstm_out, dense_out, conv_out]
- add:    Element-wise sum
- avg:    Element-wise average
- filter: Softmax-weighted combination (learned gating)

The filter mode is particularly interesting—it's a learned routing mechanism:

Text

Filter mode (Soft Mixture of Experts):

     Input
       │
       ├──────────────────────────────┐
       │                              │
  ┌────▼────┐                    ┌────▼────┐
  │ Branch 0│                    │ Gating  │
  │ (expert)│                    │ Network │
  └────┬────┘                    └────┬────┘
       │                              │
  ┌────▼────┐                         │
  │ Branch 1│                    ┌────▼────┐
  │ (expert)│                    │ Softmax │
  └────┬────┘                    │ [0.6,   │
       │                         │  0.3,   │
  ┌────▼────┐                    │  0.1]   │
  │ Branch 2│                    └────┬────┘
  │ (expert)│                         │
  └────┬────┘                         │
       │                              │
       └───────────┬──────────────────┘
                   │
                   ▼
         0.6×branch0 + 0.3×branch1 + 0.1×branch2
                   │
                   ▼
                Output

Residual Layer

Adds the input directly to the output (skip connection):

Text

        Input
          │
     ┌────┴────┐
     │         │
     ▼         │
 ┌────────┐    │
 │  Sub   │    │ Skip
 │ Layers │    │ Connection
 └───┬────┘    │
     │         │
     ▼         │
   ┌───┐       │
   │ + │◀──────┘
   └─┬─┘
     │
     ▼
   Output = SubLayers(Input) + Input

Why this matters: - Gradients can flow directly through the skip connection - Makes very deep networks trainable - If sublayers learn identity (do nothing), the layer still passes input through

Summary: Choosing the Right Layer

Task	Layer	Why
General feature transformation	Dense	Universal approximator
Image features	Conv2D	Locality, translation invariance
Sequence relationships	Attention	Long-range dependencies
Sequential memory	LSTM	Handles long sequences
Classification output	Softmax	Probabilities that sum to 1
Training stability	LayerNorm/RMSNorm	Prevents value drift
Multiple experts	Parallel + Filter	Learned routing
Deep networks	Residual	Skip connections for gradient flow