Foundations

1. What is Language?

  • Tokens: Atomic units of text (words/subwords)
  • Syntax: Rules for sentence structure
  • Semantics: Meaning representation

2. Neural Networks Basics

  • Perceptrons → MLPs
  • Activation Functions (ReLU, Softmax)
  • Backpropagation & Gradient Descent
  • Matrix Operations (Dot products, Attention as similarity scoring)

3. Probability & Statistics

  • Conditional Probability (P(word|context))
  • Entropy & Cross-Entropy Loss
  • Maximum Likelihood Estimation (MLE)

Core LLM Architecture

1. Transformer Architecture

  • Self-Attention Mechanism:
    • Query, Key, Value matrices (Q, K, V)
    • Scaled Dot-Product Attention: ( \text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V )
  • Multi-Head Attention
  • Positional Encoding (Sinusoidal/Learned)
  • Layer Normalization & Residual Connections

2. Tokenization

  • Byte-Pair Encoding (BPE)
  • WordPiece vs SentencePiece
  • Vocabulary Size Tradeoffs

3. Embeddings

  • Token Embeddings (d_model dimensions)
  • Contextual Embeddings (vs static word2vec)

Training LLMs

1. Pre-training vs Fine-tuning

  • Pre-training:
    • Masked Language Modeling (BERT-style)
    • Autoregressive Modeling (GPT-style)
    • Next Sentence Prediction
  • Fine-tuning:
    • Instruction Tuning
    • RLHF (Reinforcement Learning from Human Feedback)

2. Data Pipeline

  • Corpus Collection (Common Crawl, Books, Code)
  • Cleaning & Deduplication
  • Batch Construction (Dynamic Padding, Attention Masks)

3. Optimization

  • AdamW Optimizer (with weight decay)
  • Learning Rate Schedules (Cosine, Warmup)
  • Mixed Precision Training (FP16/FP32)
  • Gradient Checkpointing

4. Hardware Considerations

  • GPU/TPU Utilization
  • Model Parallelism (Tensor/Pipeline)
  • Memory Optimization (Flash Attention)

Advanced Concepts

1. Attention Variants

  • Sparse Attention (Longformer)
  • Linear Attention (Transformers are RNNs)
  • Memory-Augmented Attention

2. Scaling Laws

  • Chinchilla Scaling: Compute vs Data vs Model Size
  • Emergent Abilities at Scale

3. Model Compression

  • Quantization (FP32 → INT8)
  • Pruning (Removing “unimportant” weights)
  • Distillation (Teacher → Student models)

4. Alignment & Safety

  • RLHF Pipeline
  • Constitutional AI
  • Toxicity Mitigation

Applications

1. Text Generation

  • Sampling Strategies (Greedy, Beam Search, Top-k)
  • Temperature & Repetition Penalty

2. Downstream Tasks

  • Question Answering
  • Summarization
  • Code Generation

3. Retrieval-Augmented Generation (RAG)

  • Vector Databases (FAISS, Pinecone)
  • Dense vs Sparse Retrieval

Challenges

1. Hallucinations

  • Factual Inconsistency
  • Mitigation: Grounding with Knowledge Bases

2. Computational Costs

  • Training: Millions of GPU-hours
  • Inference Latency (KV Caching, Speculative Decoding)

3. Ethical Concerns

  • Bias Amplification
  • Environmental Impact

Tools & Frameworks

1. Libraries

  • Hugging Face Transformers
  • PyTorch Lightning
  • TensorFlow/JAX

2. Visualization Tools

  • Attention Head Maps
  • Embedding Projectors (UMAP/t-SNE)

3. Deployment

  • ONNX Runtime
  • Triton Inference Server
  • Quantized Models (GGML)

Math Deep Dives (Optional)

1. Attention Math

  • ( \text{softmax}(\frac{QK^T}{\sqrt{d_k}}) ) Derivation
  • Gradient Flow in Attention

2. Loss Functions

  • Perplexity: ( \exp(-\frac{1}{N} \sum_{i=1}^N \log p(x_i)) )
  • KL Divergence for Distillation

Experimental Design

1. Ablation Studies

  • Removing positional encoding
  • Varying attention heads

2. Evaluation Metrics

  • BLEU, ROUGE
  • Human Evaluation Protocols

Current Research Frontiers

  • Mixture of Experts (MoE)
  • Multimodal LLMs
  • Energy-Efficient Training