Memory in Transformers (2): Associative Memory as Test-Time Regression
February 08, 2025
Instead of naively compressing the KV cache, new architectures (e.g. Mamba, RWKV, DeltaNet, Titans, etc.) differentially weigh associative memories. We can generalise these approaches as test-time regression solvers that dynamically adjust weights to reflect relevance of past information.
Memory in Transformers (1): Linear Attention
February 02, 2025
Your transformer uses a KV-cache as its working memory. This scales linearly with context size, when it should scale with information. Linear attention seeks to compress this cache into a finite hidden-state – a rudimentary first step at selective associative memory.
The Unreasonable Effectiveness of Reasoning
January 25, 2025
A very unscientific framework for understanding reasoning models, and it's implication on the future of training.
January 23, 2025
New scaling laws may be on the horizon – R1 and o3 are RL-trained by letting the models learn via self play. Why is this possible now? And what does this mean for the future of model capabilities?
The Blessings of Dimensionality
December 15, 2024
High dimensional space is often said to be cursed. It has some unintuitive properties that complicates the separation of meaningful information in traditional machine learning. Looking closer, it turns out that this is actually a blessing in disguise for real-world data.
December 07, 2024
The maths behind diffusion models, covering forward and reverse processes, ELBO, and training objectives.