March 02, 2025
A cheap way to clean out near-duplicates from large image datasets.
Memory in Transformers (2): Associative Memory as Test-Time Regression
February 08, 2025
Instead of naively compressing the KV cache, new architectures (e.g. Mamba, RWKV, DeltaNet, Titans, etc.) differentially weigh associative memories. We can generalise these approaches as test-time regression solvers that dynamically adjust weights to reflect relevance of past information.
Memory in Transformers (1): Linear Attention
February 02, 2025
Transformers use a KV-cache as their working memory. This scales linearly with context size, when it should scale with information. Linear attention seeks to compress this cache into a finite hidden-state – which can be understood as selective associative memory.
The Unreasonable Effectiveness of Reasoning
January 25, 2025
Completely unfinished scribbles on understanding reasoning models.
The Blessings of Dimensionality
December 15, 2024
High dimensional space has some unintuitive properties that complicates separation of information. It turns out, however, that this is actually a blessing in disguise for real-world data. This is your antidote to statistical learning theory.
December 07, 2024
Some maths behind diffusion models, connecting ELBO, and training objectives. As with most things... you are probably better off writing the [code](https://github.com/victorfiz/stable_diffusion/blob/main/pipeline/diffusion.py) than reading the maths.