Home/Documentation/Understanding Layer Types

Understanding Layer Types

This guide explains what each layer type actually does—not just the math, but the intuition behind why these layers exist and when you'd use them.


Dense Layers: The Universal Connector

A Dense layer (also called "fully connected" or "linear") is the simplest and most fundamental building block. Every input connects to every output.

What It Actually Does

Imagine you have 4 inputs and want 3 outputs:

Text
    Inputs                    Outputs

    x₁ ─────┬────┬────┬────▶ y₁
            │    │    │
    x₂ ─────┼────┼────┼────▶ y₂
            │    │    │
    x₃ ─────┼────┼────┼────▶ y₃
            │    │    │
    x₄ ─────┴────┴────┴────▶ (every input connects to every output)

    Total connections: 4 × 3 = 12 weights
    Plus 3 biases (one per output)

Each output is computed as:

Text
y₁ = activation( w₁₁×x₁ + w₁₂×x₂ + w₁₃×x₃ + w₁₄×x₄ + b₁ )
y₂ = activation( w₂₁×x₁ + w₂₂×x₂ + w₂₃×x₃ + w₂₄×x₄ + b₂ )
y₃ = activation( w₃₁×x₁ + w₃₂×x₂ + w₃₃×x₃ + w₃₄×x₄ + b₃ )

Why It's Called "Dense"

Because the weight matrix is dense—every possible connection exists. This is the opposite of sparse connections (like Conv2D) where only local neighborhoods connect.

The Weight Matrix Visualized

Text
               Input features (1024)
              ┌─────────────────────────┐
              │                         │
              ▼                         ▼
           ┌──────────────────────────────┐
Row 0:     │ w₀,₀  w₀,₁  w₀,₂  ...  w₀,₁₀₂₃│ → Output 0
Row 1:     │ w₁,₀  w₁,₁  w₁,₂  ...  w₁,₁₀₂₃│ → Output 1
Row 2:     │ w₂,₀  w₂,₁  w₂,₂  ...  w₂,₁₀₂₃│ → Output 2
  ...      │  ...   ...   ...        ...   │
Row 511:   │ w₅₁₁,₀ ...             w₅₁₁,₁₀₂₃│ → Output 511
           └──────────────────────────────┘

Matrix shape: [512 outputs × 1024 inputs] = 524,288 weights

Each row computes one output neuron.
Each column represents how much one input affects all outputs.

When to Use Dense Layers

  • Classification heads: Map features to class probabilities
  • Fully connected networks: Simple stacked architectures
  • Dimensionality changes: Go from 1024 features to 256, or vice versa
  • After flattening: Following Conv2D layers in CNNs

Conv2D Layers: Finding Patterns in Images

Convolutional layers look for local patterns that can appear anywhere in an image. Instead of connecting everything to everything, they slide a small "kernel" (also called "filter") across the image.

The Key Insight

Consider detecting a vertical edge. This pattern:

Text
Dark | Light | Dark
  0  |   1   |  0
  0  |   1   |  0
  0  |   1   |  0

Can appear anywhere in an image. A Dense layer would need to learn separate weights for each position. A Conv2D layer learns the pattern once and slides it across the entire image.

How Convolution Works

Text
Input Image (4×4):                  Kernel (3×3):
┌─────┬─────┬─────┬─────┐          ┌─────┬─────┬─────┐
│  1  │  2  │  3  │  4  │          │ a   │ b   │ c   │
├─────┼─────┼─────┼─────┤          ├─────┼─────┼─────┤
│  5  │  6  │  7  │  8  │          │ d   │ e   │ f   │
├─────┼─────┼─────┼─────┤          ├─────┼─────┼─────┤
│  9  │ 10  │ 11  │ 12  │          │ g   │ h   │ i   │
├─────┼─────┼─────┼─────┤          └─────┴─────┴─────┘
│ 13  │ 14  │ 15  │ 16  │
└─────┴─────┴─────┴─────┘

Step 1: Place kernel at top-left

┌─────┬─────┬─────┐─────┐
│  1  │  2  │  3  │  4  │     Output[0,0] = 
├─────┼─────┼─────┤─────┤       1×a + 2×b + 3×c +
│  5  │  6  │  7  │  8  │       5×d + 6×e + 7×f +
├─────┼─────┼─────┤─────┤       9×g + 10×h + 11×i
│  9  │ 10  │ 11  │ 12  │
└─────┴─────┴─────┘─────┘

Step 2: Slide kernel one position right

┌─────┬─────┬─────┬─────┐
│  1  │  2  │  3  │  4  │     Output[0,1] = 
├─────┼─────┼─────┼─────┤       2×a + 3×b + 4×c +
│  5  │  6  │  7  │  8  │       6×d + 7×e + 8×f +
├─────┴─────┼─────┼─────┤       10×g + 11×h + 12×i
            │     │     │
            └─────┴─────┘

Continue until covering entire image...

Output (2×2):
┌─────┬─────┐
│ o₀₀ │ o₀₁ │
├─────┼─────┤
│ o₁₀ │ o₁₁ │
└─────┴─────┘

Stride and Padding

Stride controls how far the kernel moves each step:

Text
Stride = 1: Kernel moves 1 pixel at a time (detailed output)
Stride = 2: Kernel moves 2 pixels at a time (smaller output)

Stride 1:                    Stride 2:
Step: 0 1 2 3               Step: 0   2
      ▼ ▼ ▼ ▼                     ▼   ▼
      ■□□□□                       ■□□□□
       ■□□□                        ■
        ■□□                         
         ■□

Padding adds pixels around the edges:

Text
Without Padding:              With Padding (1 pixel):
Input: 4×4                   Input: 4×4 → Padded: 6×6
Kernel: 3×3                  Kernel: 3×3
Output: 2×2                  Output: 4×4 (same as input!)

                             ┌───┬───┬───┬───┬───┬───┐
                             │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │
                             ├───┼───┼───┼───┼───┼───┤
                             │ 0 │   │   │   │   │ 0 │
                             ├───┤   Input   │───┤
                             │ 0 │    4×4    │ 0 │
                             ├───┤           │───┤
                             │ 0 │   │   │   │   │ 0 │
                             ├───┼───┼───┼───┼───┼───┤
                             │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │
                             └───┴───┴───┴───┴───┴───┘

Multiple Filters = Multiple Feature Maps

Real Conv2D layers have multiple filters, each detecting a different pattern:

Text
Input Image                 Filters                    Feature Maps
(28×28×1)                  (3×3×1 each)               (26×26×32)

    │                    ┌──────────┐                   Output
    │                    │ Filter 0 │───▶ Edge detector
    │                    ├──────────┤                   
    └───────────────────▶│ Filter 1 │───▶ Blob detector   32 different
                         ├──────────┤                     feature maps
                         │ Filter 2 │───▶ Corner detector
                         ├──────────┤
                         │   ...    │
                         ├──────────┤
                         │Filter 31 │───▶ Some pattern
                         └──────────┘

Conv1D Layers: Local Patterns in Sequences

Conv1D is the 1D version of convolution. Instead of sliding a 2D kernel across an image, it slides a 1D kernel along a sequence (audio, time series, token embeddings). The key idea is the same: learn a local pattern once and reuse it everywhere.

Text
Input (channels × seqLen)          Kernel (channels × kernelSize)
┌───────────────┐                 ┌───────────────┐
│ c0: 1 2 3 4 5 │   ─────▶         │ c0: a b c     │
│ c1: 6 7 8 9 0 │                 │ c1: d e f     │
└───────────────┘                 └───────────────┘

Output length:

Text
outLen = (seqLen + 2*padding - kernelSize) / stride + 1

Use Conv1D when: - Sequences have local structure (phonemes, n-grams, sensor spikes) - You want translation invariance along time - You need efficient feature extraction before RNN/Attention


Embedding Layers: Token Lookup Tables

Embeddings convert discrete IDs (tokens) into dense vectors. Think of it as a learned dictionary:

Text
Embedding table: [vocabSize × embeddingDim]
Token ID 42 → row 42 → vector of length embeddingDim

Only the rows used in the current batch receive gradients. This makes embedding updates efficient even with huge vocabularies.

Use embeddings when: - Inputs are categorical (tokens, IDs, symbols) - You need a learned representation before sequence models - You want compact, trainable vector spaces


SwiGLU: Gated MLP Blocks

SwiGLU is a modern gated feedforward block used in LLMs. It uses two projections plus a gating nonlinearity:

Text
gate = SiLU(x · W_gate + b_gate)
up   = x · W_up + b_up
out  = (gate ⊙ up) · W_down + b_down

This gating tends to train better than a plain ReLU MLP, especially in deep transformer stacks.

Use SwiGLU when: - Building transformer-style feedforward blocks - You want stronger expressivity than a single dense layer


Multi-Head Attention: Learning What to Focus On

Attention is the mechanism that powers Transformers. It lets the network decide which parts of the input are relevant for each output.

The Core Idea

Imagine reading the sentence: "The cat sat on the mat because it was tired."

When interpreting "it", you need to figure out what "it" refers to. Attention learns to look back at "cat" when processing "it".

Text
Input sequence: [The] [cat] [sat] [on] [the] [mat] [because] [it] [was] [tired]
                  ↑     ↑                                      |
                  └─────┴──────────────────────────────────────┘
                        "it" attends strongly to "cat"

Query, Key, Value: The Mechanism

Attention uses three projections of each token: - Query (Q): "What am I looking for?" - Key (K): "What do I contain?" - Value (V): "What information do I provide?"

Text
Each token gets transformed into Q, K, and V vectors:

Token "it"           Token "cat"
    │                    │
    ▼                    ▼
┌───────┐            ┌───────┐
│ Q: "?" │            │ K: "animal"│
│ K: "pronoun"│      │ V: cat info │
│ V: "it"│            └───────┘
└───────┘
    │                    │
    │    Attention       │
    │    Computation:    │
    │                    │
    │   score = Q · K    │
    │   weight = softmax(score)
    │   output = weight × V
    │                    │
    ▼                    ▼
"it" looks at "cat" with high weight,
retrieves cat's information

Multi-Head: Multiple Perspectives

"Multi-head" means running multiple attention computations in parallel, each learning different relationships:

Text
Input: [batch, sequence, 512]
        │
        │ Split into 8 heads (512 ÷ 8 = 64 dims each)
        ▼
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│Head 0│Head 1│Head 2│Head 3│Head 4│Head 5│Head 6│Head 7│
│64d   │64d   │64d   │64d   │64d   │64d   │64d   │64d   │
│      │      │      │      │      │      │      │      │
│syntax│coref │topic │entity│tense │...   │...   │...   │
└──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┴──┬───┘
   │      │      │      │      │      │      │      │
   └──────┴──────┴──────┼──────┴──────┴──────┴──────┘
                        │
                        ▼ Concatenate
                   [batch, seq, 512]
                        │
                        ▼ Output projection
                   [batch, seq, 512]

Each head can specialize: one might track grammatical relationships, another coreference, another topic similarity.

The Attention Formula

Text
Attention(Q, K, V) = softmax(Q × Kᵀ / √d) × V

Where:
- Q × Kᵀ produces [seq, seq] matrix of "how much does each token attend to each other"
- √d is a scaling factor to prevent dot products from getting too large
- softmax normalizes each row to sum to 1 (probabilistic attention)
- × V retrieves the actual information based on attention weights

Visual:

Text
        Keys (all tokens)
        ┌───┬───┬───┬───┐
Queries │0.1│0.7│0.1│0.1│ "it" → mostly attends to "cat"
(all    ├───┼───┼───┼───┤
tokens) │0.3│0.3│0.2│0.2│ "cat" → attends somewhat evenly
        ├───┼───┼───┼───┤
        │0.2│0.2│0.3│0.3│ "sat" → attends to later tokens
        └───┴───┴───┴───┘

        Each row sums to 1.0 (softmax normalization)

RNN: Remembering the Past

Recurrent Neural Networks process sequences by maintaining a "hidden state" that summarizes everything seen so far.

The Key Insight

Dense layers have no memory—they process each input independently. RNNs carry information forward through time.

Text
Without RNN (Dense):

    Token 1 → [Dense] → Output 1    (no connection)
    Token 2 → [Dense] → Output 2    (no connection)
    Token 3 → [Dense] → Output 3    (can't see tokens 1 or 2!)


With RNN:

    Token 1 → [RNN] → Output 1
                │
                ▼ hidden state
    Token 2 → [RNN] → Output 2
                │
                ▼ hidden state carries info from tokens 1-2
    Token 3 → [RNN] → Output 3  ← can "remember" earlier tokens!

How the Hidden State Works

At each time step, the RNN combines: 1. Current input 2. Previous hidden state

Text
Time step t:

                Previous hidden state
                h_{t-1}
                    │
                    ▼
              ┌───────────────────┐
Current  ───▶ │                   │ ───▶ Output y_t
input x_t     │   h_t = tanh(     │
              │     W_ih × x_t +  │
              │     W_hh × h_{t-1}│
              │     + bias )      │
              └─────────┬─────────┘
                        │
                        ▼
                New hidden state h_t
                (passed to next step)

Unrolled Through Time

Text
Sequence: [x₁, x₂, x₃, x₄]

     x₁        x₂        x₃        x₄
      │         │         │         │
      ▼         ▼         ▼         ▼
   ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
   │ RNN  │→│ RNN  │→│ RNN  │→│ RNN  │
   │ Cell │ │ Cell │ │ Cell │ │ Cell │
   └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘
      │         │         │         │
      ▼         ▼         ▼         ▼
     y₁        y₂        y₃        y₄

   h₀→      h₁→      h₂→      h₃→      h₄
   (init)                              (final)

The same weights are used at every time step—the network "time-shares" its parameters.


LSTM: Solving the Vanishing Gradient Problem

Standard RNNs struggle with long sequences. Gradients either vanish (become tiny) or explode (become huge) when backpropagating through many time steps.

LSTM (Long Short-Term Memory) solves this with gates—learned mechanisms that control information flow.

The Three Gates

Text
LSTM Cell:

                      ┌─────────────────────────────────────────┐
                      │                                         │
    ┌─────────────────┼──────────────────────────────────────┐  │
    │                 ▼                                      │  │
    │   ┌──────────────────────────────────────────────┐     │  │
    │   │                 Cell State (cₜ)              │     │  │
    │   │  "The highway for information"               │     │  │
    │   └───────┬──────────┬──────────┬────────────────┘     │  │
    │           │          │          │                      │  │
    │           │          ▼          │                      │  │
    │     ┌─────▼─────┐  ┌───┐  ┌─────▼─────┐               │  │
    │     │  Forget   │  │ + │  │  Input    │               │  │
    │     │   Gate    │  │   │  │   Gate    │               │  │
    │     │  (what to │  └─┬─┘  │  (what to │               │  │
    │     │   forget) │    │    │   add)    │               │  │
    │     └─────┬─────┘    │    └─────┬─────┘               │  │
    │           │          │          │                      │  │
    │           │    ┌─────┴─────┐    │                      │  │
    │           │    │  tanh     │    │                      │  │
    │           ▼    │ (new info)│    ▼                      │  │
    │     ┌───────┐  └─────┬─────┘  ┌───────┐               │  │
    │     │ ×     │        │        │  ×    │               │  │
    │     └───┬───┘        │        └───┬───┘               │  │
    │         │            │            │                    │  │
    │         └────────────┼────────────┘                    │  │
    │                      │                                 │  │
    │                      ▼                                 │  │
    │                ┌───────────┐                           │  │
    │                │  Output   │                           │  │
    │                │   Gate    │◀──────────────────────────┘  │
    │                │ (what to  │                              │
    │                │  output)  │                              │
    │                └─────┬─────┘                              │
    │                      │                                    │
    │                      ▼                                    │
    │                 Hidden State (hₜ)                         │
    │                      │                                    │
    └──────────────────────┼────────────────────────────────────┘
                           │
                           ▼
                       Output

What Each Gate Does

Forget Gate: "What should I throw away from the cell state?"

Text
f_t = σ(W_f × [h_{t-1}, x_t] + b_f)

Output: values between 0 and 1
- 0 means "forget completely"
- 1 means "remember completely"

Example: Processing a period "." might signal to forget the subject
of the previous sentence.

Input Gate: "What new information should I store?"

Text
i_t = σ(W_i × [h_{t-1}, x_t] + b_i)     ← how much to add
g_t = tanh(W_g × [h_{t-1}, x_t] + b_g)  ← what to add

Example: Seeing a new subject noun might store that in the cell state.

Output Gate: "What should I output based on the cell state?"

Text
o_t = σ(W_o × [h_{t-1}, x_t] + b_o)
h_t = o_t × tanh(c_t)

Example: When generating a verb, output information about the subject
(for agreement) but not necessarily everything in the cell.

Why This Solves Vanishing Gradients

The cell state has a "highway" path through time:

Text
c₀ ──×f₁──+──×f₂──+──×f₃──+──×f₄──+── c₄

The gradient can flow through the + operations almost unchanged!
Unlike vanilla RNN where gradients must pass through tanh repeatedly.

KMeans Layer: Learnable Concept Prototypes

The KMeans layer is a differentiable clustering module that organizes data into "concepts" or "prototypes." Unlike standard layers that learn abstract weights, a KMeans layer learns geometric centers in a feature space.

How It Works

  1. Feature Extraction: The input passes through an attached sub-network (which can be any layer: Dense, Conv, etc.) to extract features ($z$).
  2. Distance Calculation: The layer computes the distance between the extracted features ($z$) and a set of learnable cluster centers ($C$).
  3. Soft Assignment: Distances are converted into probabilities (or feature assignments) via a specialized softmax: $$P(\text{cluster } k) = \frac{\exp(-\text{dist}(z, c_k) / \tau)}{\sum_j \exp(-\text{dist}(z, c_j) / \tau)}$$

The Power of Recursion

KMeans layers in Loom are recursive. You can use a KMeans layer as the attached sub-network for another KMeans layer.

Text
Input ──▶ [ KMeans L1: Finds "Edges" ] ──▶ [ KMeans L2: Finds "Shapes" ] ──▶ Output

Output Modes

  • probabilities: Outputs a probability distribution over the $K$ clusters (the "assignment").
  • features: Outputs the actual feature vector extracted by the sub-network.
  • reconstruction: (Experimental) Outputs the coordinates of the nearest cluster center.

When to Use KMeans Layers

  • Interpretability: The learned cluster centers are actual points in the feature space that represent prototypical examples.
  • Hierarchical Classification: Building "Concept Taxonomies" (e.g., Species inside Kingdoms).
  • Out-of-Distribution Detection: Large distances to all learned clusters indicate "unknown" data.
  • Neuro-Symbolic Reasoning: Bridging continuous neural features with discrete symbolic categories.

Softmax: Turning Numbers into Probabilities

Softmax converts a vector of arbitrary real numbers into a probability distribution (values between 0 and 1 that sum to 1).

The Basic Transformation

Text
Input (logits):  [2.0, 1.0, 0.1]
                   │
                   ▼ exp(each value)
             [7.39, 2.72, 1.11]
                   │
                   ▼ divide by sum (11.22)
Output:      [0.66, 0.24, 0.10]
             ─────────────────
               sums to 1.0 ✓

The largest input gets the largest probability, but all outputs are positive and normalized.

Why Not Just Normalize Directly?

The exponential has important properties: 1. Always positive: Even negative inputs become positive 2. Amplifies differences: Large inputs dominate 3. Smooth gradients: Differentiable everywhere

Text
Without exp:          With exp (softmax):
[2, 1, -1] → ?       [2, 1, -1] → [7.39, 2.72, 0.37]
Can't normalize      All positive! Can normalize.
when there are       
negatives.           

[10, 10, 10] → [.33, .33, .33]   Equal values → equal probs ✓
[100, 0, 0] → [.99, .004, .004]  Large diff → confident prediction

Loom's 10 Softmax Variants

Loom treats Softmax as a first-class layer with multiple variants:

Text
Standard Softmax:
[logits] → [probabilities that sum to 1]

Grid Softmax (Native Mixture of Experts!):
┌──────────────────────────────────────┐
│  Expert 0: [0.1, 0.2, 0.3, 0.4] = 1  │
│  Expert 1: [0.5, 0.2, 0.2, 0.1] = 1  │  Each ROW sums to 1
│  Expert 2: [0.1, 0.1, 0.1, 0.7] = 1  │  independently
└──────────────────────────────────────┘

Temperature Softmax:
- Low temperature (0.1): Sharp, confident, picks one option
- High temperature (2.0): Smooth, uncertain, spreads probability

Masked Softmax:
[logits] + [mask: True, False, True, True]
         → [0.33, 0.00, 0.33, 0.34]
           Masked positions get zero probability
           (Useful for: legal moves in games)

Sparsemax:
Like softmax, but can produce exact zeros
[logits] → [0.6, 0.4, 0.0, 0.0]
            Interpretable! Only a few options selected.

Normalization Layers: Keeping Activations Stable

As data flows through many layers, values can drift—becoming very large or very small. Normalization layers re-center and re-scale activations.

Layer Normalization

Normalizes across the feature dimension for each sample independently:

Text
Input: [batch, features]
        ┌────────────────────────────┐
Sample 0│ 100 │ -50 │  25 │  75 │   │ ← Normalize this row
        ├────────────────────────────┤
Sample 1│  10 │  20 │  30 │  40 │   │ ← Normalize this row
        ├────────────────────────────┤
Sample 2│ 0.1 │ 0.2 │ 0.3 │ 0.4 │   │ ← Normalize this row
        └────────────────────────────┘

For each sample:
1. Compute mean: μ = (100 + -50 + 25 + 75) / 4 = 37.5
2. Compute std:  σ = sqrt(variance) = ~54.5
3. Normalize:    (x - μ) / σ

Output: values with mean≈0, std≈1

Plus learnable parameters:
output = γ × normalized + β
(γ and β are learned per feature)

RMS Normalization (Llama-style)

Like LayerNorm but simpler—only divides by root-mean-square, no mean subtraction:

Text
rms = sqrt(mean(x²))
output = x / rms × γ

Why use it?
- Slightly faster (no mean computation)
- Works well empirically for modern LLMs
- Used in Llama, Mistral, etc.

Why Normalization Matters

Text
Without Normalization:         With Normalization:

Layer 1 output: ~100           Layer 1 output: ~100 → norm → ~0
Layer 2 output: ~10000         Layer 2 output: ~0 → x → ~0 → norm → ~0
Layer 3 output: ~1000000       Layer 3 output: ~0 → x → ~0 → norm → ~0
...                            ...
Values explode!                Values stay controlled throughout
Training becomes unstable      Training is stable

Structural Layers: Composing Complex Architectures

Sequential Layer

Chains sub-layers one after another:

Text
Sequential([Dense(512), ReLU(), Dense(256), ReLU(), Dense(10)])

    Input
      │
      ▼
  ┌────────────┐
  │Dense(512)  │
  └─────┬──────┘
        │
        ▼
  ┌────────────┐
  │   ReLU     │
  └─────┬──────┘
        │
        ▼
  ┌────────────┐
  │Dense(256)  │
  └─────┬──────┘
        │
        ▼
  ┌────────────┐
  │   ReLU     │
  └─────┬──────┘
        │
        ▼
  ┌────────────┐
  │Dense(10)   │
  └─────┬──────┘
        │
        ▼
    Output

Parallel Layer

Runs multiple branches simultaneously, then combines results:

Text
          Input
            │
     ┌──────┼──────┐
     │      │      │
     ▼      ▼      ▼
  ┌─────┐┌─────┐┌─────┐
  │LSTM ││Dense││Conv │   Three different "experts"
  └──┬──┘└──┬──┘└──┬──┘   process the same input
     │      │      │
     └──────┼──────┘
            │
            ▼
       ┌─────────┐
       │ Combine │
       └────┬────┘
            │
            ▼
         Output

Combine modes:
- concat: Concatenate all outputs [lstm_out, dense_out, conv_out]
- add:    Element-wise sum
- avg:    Element-wise average
- filter: Softmax-weighted combination (learned gating)

The filter mode is particularly interesting—it's a learned routing mechanism:

Text
Filter mode (Soft Mixture of Experts):

     Input
       │
       ├──────────────────────────────┐
       │                              │
  ┌────▼────┐                    ┌────▼────┐
  │ Branch 0│                    │ Gating  │
  │ (expert)│                    │ Network │
  └────┬────┘                    └────┬────┘
       │                              │
  ┌────▼────┐                         │
  │ Branch 1│                    ┌────▼────┐
  │ (expert)│                    │ Softmax │
  └────┬────┘                    │ [0.6,   │
       │                         │  0.3,   │
  ┌────▼────┐                    │  0.1]   │
  │ Branch 2│                    └────┬────┘
  │ (expert)│                         │
  └────┬────┘                         │
       │                              │
       └───────────┬──────────────────┘
                   │
                   ▼
         0.6×branch0 + 0.3×branch1 + 0.1×branch2
                   │
                   ▼
                Output

Residual Layer

Adds the input directly to the output (skip connection):

Text
        Input
          │
     ┌────┴────┐
     │         │
     ▼         │
 ┌────────┐    │
 │  Sub   │    │ Skip
 │ Layers │    │ Connection
 └───┬────┘    │
     │         │
     ▼         │
   ┌───┐       │
   │ + │◀──────┘
   └─┬─┘
     │
     ▼
   Output = SubLayers(Input) + Input

Why this matters: - Gradients can flow directly through the skip connection - Makes very deep networks trainable - If sublayers learn identity (do nothing), the layer still passes input through


Summary: Choosing the Right Layer

Task Layer Why
General feature transformation Dense Universal approximator
Image features Conv2D Locality, translation invariance
Sequence relationships Attention Long-range dependencies
Sequential memory LSTM Handles long sequences
Classification output Softmax Probabilities that sum to 1
Training stability LayerNorm/RMSNorm Prevents value drift
Multiple experts Parallel + Filter Learned routing
Deep networks Residual Skip connections for gradient flow