Memory in Transformers (2): Associative Memory as Test-Time Regression

Instead of naively compressing the KV cache, new architectures (e.g. Mamba, RWKV, DeltaNet, Titans, etc.) differentially weigh associative memories. We can generalise these approaches as test-time regression solvers that dynamically adjust weights to reflect relevance of past information.

Memory in Transformers (1): Linear Attention

Your transformer uses a KV-cache as its working memory. This scales linearly with context size, when it should scale with information. Linear attention seeks to compress this cache into a finite hidden-state – a rudimentary first step at selective associative memory.

The Unreasonable Effectiveness of Reasoning

A very unscientific framework for understanding reasoning models, and it's implication on the future of training.

Scaling Self-Play

New scaling laws may be on the horizon – R1 and o3 are RL-trained by letting the models learn via self play. Why is this possible now? And what does this mean for the future of model capabilities?

The Blessings of Dimensionality

High dimensional space is often said to be cursed. It has some unintuitive properties that complicates the separation of meaningful information in traditional machine learning. Looking closer, it turns out that this is actually a blessing in disguise for real-world data.

Diffusion Models

The maths behind diffusion models, covering forward and reverse processes, ELBO, and training objectives.