PCA hash deduplication

A cheap way to clean out near-duplicates from large image datasets.

Memory in Transformers (2): Associative Memory as Test-Time Regression

Instead of naively compressing the KV cache, new architectures (e.g. Mamba, RWKV, DeltaNet, Titans, etc.) differentially weigh associative memories. We can generalise these approaches as test-time regression solvers that dynamically adjust weights to reflect relevance of past information.

Memory in Transformers (1): Linear Attention

Transformers use a KV-cache as their working memory. This scales linearly with context size, when it should scale with information. Linear attention seeks to compress this cache into a finite hidden-state – which can be understood as selective associative memory.

The Unreasonable Effectiveness of Reasoning

Completely unfinished scribbles on understanding reasoning models.

The Blessings of Dimensionality

High dimensional space has some unintuitive properties that complicates separation of information. It turns out, however, that this is actually a blessing in disguise for real-world data. This is your antidote to statistical learning theory.

Diffusion Models

Some maths behind diffusion models, connecting ELBO, and training objectives. As with most things... you are probably better off writing the [code](https://github.com/victorfiz/stable_diffusion/blob/main/pipeline/diffusion.py) than reading the maths.