My Blog

Memory in Transformers (2): Associative Memory as Test-Time Regression

February 08, 2025 Technical post

Instead of naively compressing the KV cache, new architectures (e.g. Mamba, RWKV, DeltaNet, Titans, etc.) differentially weigh associative memories. We can generalise these approaches as test-time regression solvers that dynamically adjust weights to reflect relevance of past information.

Memory in Transformers (1): Linear Attention

February 02, 2025 Technical post

Your transformer uses a KV-cache as its working memory. This scales linearly with context size, when it should scale with information. Linear attention seeks to compress this cache into a finite hidden-state – a rudimentary first step at selective associative memory.

The Unreasonable Effectiveness of Reasoning

January 25, 2025 Opinion post

A very unscientific framework for understanding reasoning models, and it's implication on the future of training.

Scaling Self-Play

January 23, 2025 Opinion post

New scaling laws may be on the horizon – R1 and o3 are RL-trained by letting the models learn via self play. Why is this possible now? And what does this mean for the future of model capabilities?

The Blessings of Dimensionality

December 15, 2024 Technical post

High dimensional space is often said to be cursed. It has some unintuitive properties that complicates the separation of meaningful information in traditional machine learning. Looking closer, it turns out that this is actually a blessing in disguise for real-world data.

Diffusion Models

December 07, 2024 Technical post

The maths behind diffusion models, covering forward and reverse processes, ELBO, and training objectives.