Foundations
1. What is Language?
- Tokens: Atomic units of text (words/subwords)
- Syntax: Rules for sentence structure
- Semantics: Meaning representation
2. Neural Networks Basics
- Perceptrons → MLPs
- Activation Functions (ReLU, Softmax)
- Backpropagation & Gradient Descent
- Matrix Operations (Dot products, Attention as similarity scoring)
3. Probability & Statistics
- Conditional Probability (P(word|context))
- Entropy & Cross-Entropy Loss
- Maximum Likelihood Estimation (MLE)
Core LLM Architecture
- Self-Attention Mechanism:
- Query, Key, Value matrices (Q, K, V)
- Scaled Dot-Product Attention: ( \text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V )
- Multi-Head Attention
- Positional Encoding (Sinusoidal/Learned)
- Layer Normalization & Residual Connections
2. Tokenization
- Byte-Pair Encoding (BPE)
- WordPiece vs SentencePiece
- Vocabulary Size Tradeoffs
3. Embeddings
- Token Embeddings (d_model dimensions)
- Contextual Embeddings (vs static word2vec)
Training LLMs
1. Pre-training vs Fine-tuning
- Pre-training:
- Masked Language Modeling (BERT-style)
- Autoregressive Modeling (GPT-style)
- Next Sentence Prediction
- Fine-tuning:
- Instruction Tuning
- RLHF (Reinforcement Learning from Human Feedback)
2. Data Pipeline
- Corpus Collection (Common Crawl, Books, Code)
- Cleaning & Deduplication
- Batch Construction (Dynamic Padding, Attention Masks)
3. Optimization
- AdamW Optimizer (with weight decay)
- Learning Rate Schedules (Cosine, Warmup)
- Mixed Precision Training (FP16/FP32)
- Gradient Checkpointing
4. Hardware Considerations
- GPU/TPU Utilization
- Model Parallelism (Tensor/Pipeline)
- Memory Optimization (Flash Attention)
Advanced Concepts
1. Attention Variants
- Sparse Attention (Longformer)
- Linear Attention (Transformers are RNNs)
- Memory-Augmented Attention
2. Scaling Laws
- Chinchilla Scaling: Compute vs Data vs Model Size
- Emergent Abilities at Scale
3. Model Compression
- Quantization (FP32 → INT8)
- Pruning (Removing “unimportant” weights)
- Distillation (Teacher → Student models)
4. Alignment & Safety
- RLHF Pipeline
- Constitutional AI
- Toxicity Mitigation
Applications
1. Text Generation
- Sampling Strategies (Greedy, Beam Search, Top-k)
- Temperature & Repetition Penalty
2. Downstream Tasks
- Question Answering
- Summarization
- Code Generation
3. Retrieval-Augmented Generation (RAG)
- Vector Databases (FAISS, Pinecone)
- Dense vs Sparse Retrieval
Challenges
1. Hallucinations
- Factual Inconsistency
- Mitigation: Grounding with Knowledge Bases
2. Computational Costs
- Training: Millions of GPU-hours
- Inference Latency (KV Caching, Speculative Decoding)
3. Ethical Concerns
- Bias Amplification
- Environmental Impact
1. Libraries
- Hugging Face Transformers
- PyTorch Lightning
- TensorFlow/JAX
- Attention Head Maps
- Embedding Projectors (UMAP/t-SNE)
3. Deployment
- ONNX Runtime
- Triton Inference Server
- Quantized Models (GGML)
Math Deep Dives (Optional)
1. Attention Math
- ( \text{softmax}(\frac{QK^T}{\sqrt{d_k}}) ) Derivation
- Gradient Flow in Attention
2. Loss Functions
- Perplexity: ( \exp(-\frac{1}{N} \sum_{i=1}^N \log p(x_i)) )
- KL Divergence for Distillation
Experimental Design
1. Ablation Studies
- Removing positional encoding
- Varying attention heads
2. Evaluation Metrics
- BLEU, ROUGE
- Human Evaluation Protocols
Current Research Frontiers
- Mixture of Experts (MoE)
- Multimodal LLMs
- Energy-Efficient Training