The Dream: Teaching Computers to Remember

It's 1995. You're a PhD student staring at a fundamental problem that feels almost philosophical: how do you teach a machine to remember?

Imagine you're reading a sentence, but the moment you finish each word, you completely forget it ever existed. By the time you reach the end of "The cat sat on the mat," you have no idea what was sitting, where it was sitting, or even that there was a cat at all.

That's essentially what early neural networks were like. They were brilliant at recognizing a single image or classifying a single data point, but they had the attention span of a goldfish on espresso. Every input was treated as if nothing had come before it.

Why does memory matter?

Think about translation. To translate "The bank is by the river" vs "The bank raised interest rates," you need to remember context from earlier in the sentence. Or consider music—a note only makes sense in relation to the notes that came before it. Without memory, AI is stuck in an eternal present.

Here's what kept researchers up at night in the early 90s:

Speech recognition required understanding phonemes in the context of entire words
Language modeling needed to track subjects across multiple clauses
Time series prediction demanded remembering patterns from hundreds of steps back

The stakes were real: whoever solved memory would unlock entire categories of AI applications.

In the late 1980s and early 1990s, researchers had a dream: build neural networks that could learn from sequences. The tool they created was called a Recurrent Neural Network (RNN). And for a brief, hopeful moment, it seemed like the problem of memory was solved.

It wasn't. It failed spectacularly. And understanding why it failed—not just mathematically, but viscerally—is the key to understanding the brilliance of the solution that followed.

Part 1: The Loop — Recurrent Neural Networks

The insight behind RNNs is elegant: what if we just loop the output back as input?

Regular neural networks are like assembly lines—data goes in one end, gets processed, and predictions come out the other. Each piece of data is independent. But an RNN adds a feedback loop. At each step, the network takes the current input and its own previous "hidden state" (its memory).

# The RNN Forward Pass (Simplified)
h_t = tanh(W_xh @ x_t + W_hh @ h_{t-1} + b)

At each timestep $t$ :

$x_t$ is the current input (like a word).
$h_{t-1}$ is the memory of everything seen before.
$W_{hh}$ is the "memory weight" that decides how much of the old state to keep.

Let's Watch it Work: A Concrete Example

Let's trace through a tiny RNN trying to process the sequence "CAT" (encoded as one-hot vectors). Our RNN has a 2-dimensional hidden state. We'll initialize $h_0 = [0, 0]$ and watch what happens.

Step 1: Processing 'C'

x_1 = [1, 0, 0]  # 'C' encoded
h_1 = tanh(W_xh @ [1,0,0] + W_hh @ [0,0] + b)
    = tanh([0.5, -0.3])  # After matrix multiplications
    = [0.46, -0.29]

Step 2: Processing 'A'

x_2 = [0, 1, 0]  # 'A' encoded
h_2 = tanh(W_xh @ [0,1,0] + W_hh @ [0.46,-0.29] + b)
    = tanh([0.3 + 0.23, 0.4 - 0.15])  # Mixed with previous state
    = [0.49, 0.24]

Notice how $h_2$ contains information from BOTH 'C' (via $h_1$ ) and 'A' (via $x_2$ ). This is the magic: the hidden state is carrying forward a "summary" of everything seen so far.

The Intuition

You can think of the hidden state $h_t$ as a summary of the past. As you read a sentence, your brain doesn't store every single character; it maintains a "gist" of the meaning so far. That's what $h_t$ is trying to do.

Here's the beautiful part: because we use the same weight matrices ( $W_{hh}$ , $W_{xh}$ ) at every timestep, the network learns to create a summary that's useful across time. It's not just random compression—it's learned compression optimized for your specific task.

In theory, yes! That's the mathematical promise. The hidden state at step 100 has been influenced by all 100 inputs. But here's the catch: you're trying to compress an arbitrarily long sequence into a fixed-size vector (say, 128 numbers). Old information naturally gets "washed away" as new information arrives.

But that's not even the real killer. The real killer isn't the forward pass—it's the learning.

Think about it: if your hidden state is [0.46, -0.29] at step 100, and somewhere in that compressed representation is a signal from step 1... how do you teach the network which weight at step 1 needs to change? You have to trace the error backward through 99 steps of nonlinear transformations. And that's where the math breaks down.

Part 2: The Calculus of Failure — BPTT and the Vanishing Gradient

To teach a network, we use Backpropagation Through Time (BPTT). We look at the error at the end of a sequence and trace it backward to figure out how to adjust the weights at the beginning.

Imagine you're at the top of a mountain (the error) and you want to roll a ball down to the valley (the minimum error). The gradient tells you which way is down. In an RNN, to find the gradient for a weight at step 1, you have to roll that ball through every single timestep from 100 back to 1.

The Multiplication Chain

This is where the math breaks—and I mean truly, catastrophically breaks. Because of the Chain Rule, the gradient at step $t$ depends on the product of all the gradients between now and the end of the sequence.

Let's trace through what actually happens. If we want to know how the hidden state at the final step $T$ was affected by a change in the state at some early step $t$ :

$\frac{\partial h_T}{\partial h_t} = \prod_{k=t+1}^{T} \frac{\partial h_k}{\partial h_{k-1}}$

In a simple RNN, each term in that product is approximately:

$\frac{\partial h_k}{\partial h_{k-1}} \approx W_{hh} \cdot \text{diag}(\tanh'(h_{k-1}))$

Here's the critical insight: $\tanh'(x)$ (the derivative of tanh) is always less than 1.0. In fact, for most values, it's much closer to 0.1 or 0.2.

Watch the Gradient Die: A Numerical Walkthrough

Let's say $W_{hh}$ has a spectral norm (largest eigenvalue) of 0.9, and the average value of $\tanh'$ across your sequence is 0.3. For a sequence of length 100:

Step 100 → Step 99:
Gradient = $0.9 \times 0.3 = 0.27$

Step 100 → Step 98:
Gradient = $0.27 \times 0.27 = 0.073$

Step 100 → Step 90:
Gradient ≈ $(0.27)^{10} \approx 2.9 \times 10^{-6}$

Step 100 → Step 1:
Gradient ≈ $(0.27)^{99} \approx 10^{-58}$

That's not a typo. The gradient has completely vanished into floating-point zero. The weights at step 1 receive NO learning signal whatsoever.

The Exponential Catastrophe

Think about what happens when you multiply a number by itself 100 times:

$1.1^{100} \approx 13,780$ (Exploding! Your gradients blow up and the network "breaks" with NaNs.)
$0.9^{100} \approx 0.000027$ (Vanishing! Your learning signal effectively hits zero.)
$1.0^{100} = 1.0$ (The LSTM's secret sauce—we'll get there.)

This isn't a bug—it's a fundamental mathematical consequence of chained multiplication. You can't just "tune your learning rate" out of this. The problem is structural.

The Vanishing Gradient Problem

Live

Weight (W)0.80

Vanishing (<1.0)Stable (1.0)Exploding (>1.0)

Timesteps (t)10

Step 1

1.0

× 0.80 ^ 10

Step 10

0.10737

Result: Stable Gradient

What This Means in Practice

When the gradient vanishes, the weights at the beginning of the sequence never change. The network can't learn long-term dependencies because it literally cannot measure how earlier inputs affect later outputs.

Here's what broke people's hearts in 1995:

RNNs could learn "predict the next character given the previous 5 characters" ✅
RNNs could NOT learn "remember a symbol from position 2 and use it at position 100" ❌

The theoretical promise of infinite memory crashed into the computational reality of vanishing gradients. Papers were published. Conferences debated whether recurrent learning was fundamentally impossible.

And then, in 1997, two researchers in Germany proposed something radical.

Part 3: The 1997 Breakthrough — The Constant Error Carousel

In 1997, Sepp Hochreiter and Jürgen Schmidhuber proposed a radical fix. If the problem is that gradients vanish because of repeated multiplication, why not just... add?

They introduced the Constant Error Carousel (CEC). This is the heart of the LSTM, and once you see it, you can't unsee it—it's the same principle that powers modern Transformers.

The "Highway" of Information

Instead of a hidden state that gets mangled by a weight matrix at every step, the LSTM introduces a Cell State ( $s_c$ ). In the original 1997 paper, the core update looks like this:

$s_{c, t} = s_{c, t-1} + (y_{in} \cdot g(\text{net}_c))$

Notice the magic: the previous state $s_{c, t-1}$ is added directly to the new information. There is no weight matrix multiplying $s_{c, t-1}$ !

In calculus terms:

$\frac{\partial s_{c, t}}{\partial s_{c, t-1}} = 1.0$

No tanh derivative. No weight matrix. Just 1.0.

The Carousel Insight

The gradient "rides" this carousel through time. Since $1 \times 1 \times 1 ... = 1$ , the error signal never shrinks and never grows. It can travel across 1,000 timesteps as easily as it travels across 1.

This is why Hochreiter called it the "Constant Error Carousel"—the error literally circulates at constant magnitude, like a carousel spinning at fixed speed.

The Constant Error Carousel

Live

Signal Strength

1.0000

RNN Update

h = h × W

LSTM Update

s = s + 0

Step: 0

But Wait—How Do We Control This Highway?

If information flows perfectly through time, two problems emerge:

How do we decide WHAT to write to the cell state? (Not every input is important.)
How do we decide WHEN to read from the cell state? (Not every output needs the memory.)

This is where the gates come in. And they're absurdly elegant.

The Input Gate: "Is This Worth Remembering?"

The input gate ( $y_{in}$ ) is a number between 0 and 1 (produced by a sigmoid). It acts as a write permission:

y_in = sigmoid(W_in @ x_t + U_in @ h_{t-1} + b_in)
# y_in ≈ 0 → "Ignore this input, it's noise"
# y_in ≈ 1 → "Important! Write this to memory"
 
s_c_t = s_c_prev + (y_in * g_val)  # Controlled write

Concrete Example:
You're reading: "The cat, which had gray stripes and a bushy tail, sat on the..."

"cat" → $y_{in} = 0.9$ (IMPORTANT, remember this!)
"which" → $y_{in} = 0.1$ (filler word, skip it)
"gray" → $y_{in} = 0.3$ (minor detail, light write)
"sat" → $y_{in} = 0.8$ (action, remember!)

The input gate learns to be a relevance filter. It protects the cell state from being polluted by noise.

The Output Gate: "Should I Expose This Memory Now?"

The output gate ( $y_{out}$ ) controls read permission:

y_out = sigmoid(W_out @ x_t + U_out @ h_{t-1} + b_out)  
h_t = y_out * h(s_c_t)  # Controlled read
# y_out ≈ 0 → "Don't reveal the memory yet"
# y_out ≈ 1 → "Output needs this info NOW"

Concrete Example:
You're predicting the next word: "The cat ___ on the mat."

At the blank, the output gate spikes to 1.0 because NOW is when you need to recall what the subject was. Earlier in the sentence, the output gate stays low—the model is still gathering context.

Exactly! This is why the original 1997 LSTM was actually a bit too good at remembering. It didn't have a way to "reset" its memory.

That feature—the Forget Gate—wasn't added until 1999 by Felix Gers. In the 1997 version, the only way to "clear" the cell state was to wait for it to be overwritten by new inputs. This worked for short sequences but struggled on infinite streams.

The forget gate ( $y_{forget}$ ) multiplies the cell state by a number between 0 and 1:

s_c_t = (y_forget * s_c_prev) + (y_in * g_val)

Now the network can actively "forget" irrelevant old memories. After a sentence ends, the forget gate can zero out the cell state for the next sentence.

Part 4: Deep Dive into the 1997 Architecture

Most modern LSTM tutorials show you the "standard" version with forget gates and layer norm. But the 1997 original was a masterpiece of "bare metal" gradient engineering. Let's decode the tricks Hochreiter and Schmidhuber used.

1. The "Scissors" (Truncated BPTT)

Here is a detail most people miss: to prevent the gates themselves from re-introducing the vanishing gradient problem, the authors "cut" the gradient flow through the recurrent gate connections.

Think about it: the input and output gates have their own recurrent connections ( $U_{in}$ , $U_{out}$ ). If gradients flowed through those paths over 100 timesteps, we'd be right back where we started!

Their solution: detach the hidden state when computing gate activations. The gates can see the past, but they can't learn from the distant past. Only the CEC gets that privilege.

In our reproduction code, this looks like:

# From simple-lstm/src/aquarius_lstm/cell_torch.py
h_frozen = h_prev.detach()  # THE SCISSORS ✂️
 
# Gates learn from current input and 'frozen' past
y_in = torch.sigmoid(W_in @ x_t + U_in @ h_frozen + b_in)
y_out = torch.sigmoid(W_out @ x_t + U_out @ h_frozen + b_out)
 
# The CEC stays open! We do NOT detach s_c_prev
s_c_t = s_c_prev + (y_in * g_val)  # ← Gradients flow through time HERE

By using these "scissors," they ensured that the only way for gradients to flow long-distance was through the protected CEC highway.

Why This Is Brilliant:
The gates can make context-aware decisions ("Should I write now?") based on recent history, but they don't need to learn from 100 steps ago. The cell state learns from 100 steps ago. This division of labor is key.

2. The "Abuse Problem" and Negative Biases

If the CEC is a perfect memory highway, what stops the network from trying to store everything? In the 1997 paper, they talk about the Abuse Problem: the network gets "lazy" and tries to use the memory cell for short-term noise instead of long-term patterns.

Imagine you're trying to learn "remember the first character and output it after 50 steps." If the CEC is too easy to access, the network might cheat: store every character, not just the important first one. The memory fills with garbage.

Their solution? Negative Biases. By initializing the input gate biases to $-3.0$ or even $-6.0$ , the gates are "locked" shut at the start of training:

$y_{in} = \sigma(-6.0 + \text{small signal}) \approx 0.0025$

The network has to earn the right to use the memory by proving an input is consistently useful enough to overcome that negative bias.

This forces the network to be selective. Only truly important, long-term patterns get stored. Short-term noise gets filtered out.

3. The Specific Activations

The 1997 paper used very specific ranges for its activation functions:

$g(x) = 4\sigma(x) - 2$ : Squashes input to $[-2, 2]$ .
$h(x) = 2\sigma(x) - 1$ : Squashes output to $[-1, 1]$ .

Why not just use $\tanh$ , which ranges from $[-1, 1]$ ?

Because range matters for gradient flow. When you're adding small values to a cell state over 100 timesteps, using $g(x) \in [-2, 2]$ instead of $[-1, 1]$ means you can accumulate values faster. It prevents the signal from becoming too small too quickly.

And centering around 0 (not 0.5) allows the cell state to store both positive and negative "evidence":

Positive cell state: "I saw an opening bracket"
Negative cell state: "I saw a closing bracket"

This bidirectional representation is crucial for tasks like balanced parenthesis detection.

The 1997 LSTM Memory Cell

Live

Step 1/5: Input Arrives

s_c(t)=s_c(t-1)+g(net_c)·y_in

Input Arrives: net_c (Raw Input) enters the cell

The Full Picture: What's Actually Happening

Let's trace through one timestep with concrete numbers. Suppose:

Cell state $s_{c, t-1} = 1.5$ (we're holding onto some past info)
Current input $x_t$ = "important word"
Previous hidden state $h_{t-1} = [0.3, -0.4]$

Step 1: Decide what to write (Input Gate)

net_in = W_in @ x_t + U_in @ h_{t-1}.detach() + b_in  
      = 0.8 + 0.1 - 3.0 = -2.1  # Note the negative bias
y_in = sigmoid(-2.1) = 0.11  # "Low confidence, write cautiously"

Step 2: What are we trying to write? (Input Squashing)

net_c = W_c @ x_t + b_c = 1.5
g_val = 4*sigmoid(1.5) - 2 = 4*0.82 - 2 = 1.28

Step 3: Write to memory (CEC Update)

s_c_t = s_c_prev + (y_in * g_val)
      = 1.5 + (0.11 * 1.28)  
      = 1.5 + 0.14 = 1.64  # Slow, controlled update

Step 4: Decide what to reveal (Output Gate)

net_out = W_out @ x_t + U_out @ h_{t-1}.detach() + b_out = 0.5
y_out = sigmoid(0.5) = 0.62  # "Moderate confidence, reveal most of it"

Step 5: Produce hidden state (Output Squashing)

h_val = 2*sigmoid(s_c_t) - 1 = 2*sigmoid(1.64) - 1 = 0.84
h_t = y_out * h_val = 0.62 * 0.84 = 0.52

The cell state grew from 1.5 → 1.64 (preserving the past, adding new info), while the hidden state is a controlled "view" (0.52) of that internal memory.

Part 5: The Gauntlet — Original Experiments

The 1997 paper didn't just show equations; it proved LSTMs could solve tasks that were provably impossible for standard RNNs to learn. These weren't toy problems—they were carefully designed stress tests that isolated specific failure modes of gradient-based learning.

We reproduced these classic benchmarks to see if the "primitive" LSTM, with none of the modern tricks (no layer norm, no dropout, no fancy optimizers), still holds up in 2026.

Repository

simple-1997-lstm-reproduction

A faithful reproduction of the 1997 Hochreiter & Schmidhuber paper. Implements the exact architecture, including the 'scissors' and paper-accurate activation ranges.

Python / PyTorch

1. The Adding Problem (Section 5.4)

The Challenge: A sequence of 100+ numbers. Two random numbers are "marked" with a 1.0 in a second input channel. The network must ignore all the noise and output the sum of only those two numbers at the very end.

Why it's impossibly hard for RNNs: The network has to "hold" a precise numerical value in its head for 90+ steps of pure distraction. Standard RNNs fail this because they can't maintain the precision—the "sum" evaporates before they reach the end.

What makes this diabolical: A baseline that just outputs 0.0 gets an MSE of ~0.167. To beat that, you need to selectively remember exactly two values out of 100, ignore the other 98, and maintain floating-point precision across the entire sequence.

The Adding Problem

Section 5.4

PASS

Paper CriterionAbsolute error < 0.04

Measured Results

final error0.0004

sequence length100

The CEC acts like a perfect accumulator. It just waits for the markers and 'clicks' the values into place. Watch the input gate: it spikes to ~1.0 only at the marked positions, staying near ~0.0 everywhere else.

What we learned: The input gate learns to be a perfect signal detector. It's not doing fuzzy pattern matching—it's doing binary classification at each timestep: "Is this marked? Yes → write. No → ignore." The cell state becomes a literal accumulator: $s_c = 0 + val_1 + 0 + 0 + ... + val_2$ .

2. Embedded Reber Grammar (Section 5.1)

The Challenge: Predict the next symbol in a sequence generated by a nested state machine. The grammar looks like:

B → T → S → X → ... → T → E
     ↓         ↓
   (inner)   (inner)

To predict the final symbol 'E', you need to remember that the second symbol you saw was 'T' (indicating you're in the T-branch), even though you've processed an entire "inner" sequence in between.

Why it's hard: It requires hierarchical memory. You're tracking two levels of state: "Where am I in the outer loop?" and "Where am I in the inner loop?"

Embedded Reber Grammar

PASS

Paper Criterion

Measured Results

accuracy100%

epochs182

Tests hierarchical memory: 'I'm in the T-branch of the outer loop, and the S-loop of the inner loop.' The cell state maintains a *distributed* encoding of nested context.

What we learned: The cell state doesn't just store a single "fact"—it stores a compressed representation of nested structure. Different dimensions of the cell state vector encode different levels of the hierarchy. This is emergent behavior; we didn't design it, the gradient discovered it.

3. Temporal Order (Section 5.6)

The Challenge: Two symbols (X or Y) appear at random times in a sequence of distractors. Classify the sequence as XX, XY, YX, or YY.

Why it's hard: You can't just remember "I saw an X." You have to remember the order. If you saw an X then a Y, it's different from a Y then an X. And the symbols might be separated by 50 distractors.

The failure mode for RNNs: By the time you see the second symbol, the gradient has vanished so much that the network can't tell the difference between "X appeared first" vs "Y appeared first." Both feel like distant, blurry noise.

Gradient Highway: LSTM vs RNN

Live

The Gradient Superhighway

Error flows backward through time without decaying

Gradient Flow

The additive update means gradients just pass through unchanged (derivative is 1.0)

Multiplicative gates (input/output) are side branches, not main road blocks

What we learned: The LSTM can maintain two pieces of information: "What was the first symbol?" and "Have I seen the second one yet?" The input gate writes the first symbol, stays closed during distractors, then writes the second symbol to a different part of the cell state. The CEC preserves both.

The Meta-Lesson: Gradient Engineering Works

What these experiments prove is not just that "LSTMs are better than RNNs." They prove that designing architectures specifically for gradient flow is possible and necessary.

Every one of these tasks is learnable in principle by an RNN—the information is mathematically present in the hidden state. But the RNN can't learn it because the gradient signal vanishes. The LSTM doesn't have better expressive power; it has better learnability.

Part 6: Legacy — From 1997 to GPT-5

The LSTM was the king of AI for nearly 20 years. If you used Google Translate before 2017, Siri before 2019, or Alexa... ever, you were using an LSTM.

But more importantly: the LSTM taught us a fundamental lesson that echoes through every modern architecture.

1997

Birth of LSTM

Hochreiter and Schmidhuber introduce the CEC and gates to solve vanishing gradients. The paper is largely ignored at first—recurrent networks are considered a dead end.

1999

The Forget Gate

Felix Gers adds the ability to reset memory, allowing LSTMs to process infinite streams of data. This turns the LSTM from a 'batch processor' into a true online learner.

2005-2013

The Silent Takeover

LSTMs quietly become the backbone of speech recognition (Google, Microsoft), translation (Google Translate), and text-to-speech. No fanfare—just results.

2014

The GRU

Kyunghyun Cho simplifies the LSTM into the Gated Recurrent Unit—fewer parameters, similar performance. Shows that the core idea (gated additive updates) is more important than the specific architecture.

2017

The Transformer

The 'Attention is All You Need' paper shows we can process sequences without any loops at all by using Attention. LSTMs begin to fade from cutting-edge research.

Today

Residual Connections

The core trick of the Transformer—the Residual Connection ($x + f(x)$)—is mathematically the same idea as the CEC: a linear pathway for gradients to flow. Hochreiter's insight is now *everywhere*.

The Idea That Wouldn't Die

Here's the thing that blows my mind: if you look at a modern Transformer, you'll find the LSTM's DNA everywhere.

Residual Connections (used in ResNets, Transformers, etc.): $\text{output} = x + f(x)$

LSTM Cell State Update: $s_c = s_{c, prev} + (\text{gate} \cdot \text{input})$

See it? They're the same idea. Don't multiply the signal—add to it. Give the gradient a highway.

The only difference is that Transformers use residual connections between layers, while LSTMs use them across time. But the principle is identical: preserve a direct path for gradients to flow.

Three reasons:

Conceptual Clarity: Understanding LSTMs teaches you the physics of gradient flow in a simpler setting than Attention. The CEC is pure and clean; it isolates the single idea of "additive updates for constant gradients."
Resource Constraints: Transformers have $O(N^2)$ memory cost for sequence length $N$ . LSTMs are $O(N)$ . For real-time sensor processing, long audio streams, or edge devices, LSTMs are still superior.
Online Learning: Transformers need the whole sequence upfront (for self-attention). LSTMs can process infinite streams token-by-token with constant memory. For robotics, streaming applications, and online adaptation, LSTMs remain essential.

Plus, there's a philosophical reason: Transformers parallelized sequence processing by removing recurrence. But many problems are inherently sequential. The LSTM's constraint (process one step at a time) is sometimes a feature, not a bug.

What the LSTM Really Taught Us

The 1997 paper wasn't just a new architecture. It was a proof of concept for a new way of thinking about neural network design:

Don't just make networks deeper; make them learnable. It doesn't matter how expressive your model is if the gradient can't reach the parameters.
You can engineer gradient flow. The CEC wasn't discovered by accident—it was carefully designed based on first-principles analysis of backpropagation.
Simple mechanisms can solve hard problems. The CEC is literally just addition. But combined with gates (sigmoids), it solves problems that stumped researchers for years.

This mindset led directly to:

Batch Normalization (2015): Normalize activations to stabilize gradients
Residual Connections (2015): Highway for gradients in deep networks
Layer Normalization (2016): Stabilize recurrent network training
Attention (2017): Replace recurrence with parallelizable weighted averaging

Every one of these innovations shares the LSTM's core insight: design for the gradient first, the forward pass second.

Conclusion: The Water Bearer

In the Aquarius constellation, the water bearer carries a vessel across the sky, ensuring not a drop is spilled. That is the LSTM. It carries the "water" of information across the desert of time, protecting it from the evaporation of vanishing gradients.

The 1997 paper wasn't just a new architecture; it was a masterclass in gradient engineering. It taught us that if you want a network to learn, you have to build it a road that the signal can actually travel on.

What You Should Take Away

If you remember nothing else from this post, remember these three insights:

The forward pass is not the bottleneck; the backward pass is. RNNs could represent long-term dependencies in theory. But they couldn't learn them because the gradient vanished. Architecture design is about learnability, not just expressiveness.
Addition is magical. $s_c = s_{c,prev} + \text{new\_info}$ has a derivative of exactly 1.0 with respect to $s_{c,prev}$ . This simple algebraic fact enabled 20 years of AI progress. Sometimes the most powerful tools are the simplest.
Gates are learnable control flow. The input/output gates are not just "hyperparameters" or "activations"—they're conditional statements that learn themselves. "IF this input looks important THEN write it to memory" becomes a differentiable computation.

The Human Element

Here's what I love most about the LSTM story: it wasn't an accident. Hochreiter didn't stumble upon this architecture by randomly trying things. He derived it from first principles.

He asked: "Why do gradients vanish?" → Repeated multiplication.
He asked: "How do we prevent repeated multiplication?" → Use addition.
He asked: "How do we control what gets added?" → Use gates.

This is engineering at its finest. Not trial and error, not hyperparameter sweeps, not throwing compute at the wall. Deep understanding leading to elegant solutions.

When I reproduced the 1997 experiments, I felt like an archaeologist finding a working piece of ancient technology. The LSTM is 29 years old. It predates Google, predates Facebook, predates the term "deep learning." And yet it still works, with zero modifications, on tasks designed to break it.

That's the mark of a fundamental idea.

Join the Reproduction

This post is part of Project Aquarius—our mission to rebuild the classics from scratch and extract the timeless principles of AI.

We believe that true understanding comes from reimplementation. Check out our full implementation guide to:

Run the exact experiments from the 1997 paper
Experiment with the "scissors" and see what happens when you remove them
Compare 1997 LSTM vs modern LSTM vs GRU vs Transformer
Visualize gradient flow through time in real-time

The code is clean, documented, and designed for learning. Join us in bringing the foundations of AI back to life.

Giving AI Memory: The Journey from RNNs to LSTMs

The Adding Problem

Measured Results

Embedded Reber Grammar

Measured Results

The Gradient Superhighway

Birth of LSTM

The Forget Gate

The Silent Takeover

The GRU

The Transformer

Residual Connections