Resources

Post Training

A short introduction to RLHF and post-training: PDF

Inference

Lil’ Log: Inference Optimization: Distillation, Quantization, Pruning, Sparsity, Mixture-of-Experts, Architectural Optimization

Deep Dive: Optimizing LLM inference

Assisted Generation: a new direction toward low-latency text generation – Hugging Face

LLM Transformer Inference Guide – Baseten

Fast LLM Inference From Scratch – Andrew Chan

Scaling Training

Scaling GPU clusters and data-parallelism

How to Scale Your Model – A Systems View of LLMs on TPUs

CUDA

Simon Oz: Writing CUDA kernels

Introduction to CUDA Programming