[full_width]
Details on all twenty concepts
Functional LLM concepts
Introduction
There are 20 core functional concepts in LLMs. We introduce them below, and link them with one common illustration. Details are given after that illustration. Enjoy learning!
A. Core Model Capacity & Architecture (What the model
is)
- Mixture
of Experts (MoE)
Sparse expert routing to scale model capacity without linear compute growth.
B. Training & Learning Strategies (How the model learns)
- Curriculum
Learning
Structured data ordering from easy to hard for better convergence and stability. - Contrastive
Learning
Representation learning via similarity and dissimilarity objectives. - Self-Training
(Pseudo-Labeling)
Bootstrapping supervision using model-generated labels. - Knowledge
Distillation (Teacher–Student Training)
Transferring behavior and capabilities from a large model to a smaller one. - Parameter-Efficient
Fine-Tuning (PEFT / LoRA / Adapters)
Adapting large models with minimal trainable parameters. - Prompt
Tuning / Soft Prompts
Steering model behavior using learned prompt embeddings instead of weight updates.
C. Model Optimization & Compression (Making models
smaller/cheaper)
- Pruning
Removing low-importance weights or components. - Quantization
Lowering numerical precision to reduce memory and latency. - Model
Compression (Combined Pipeline)
Systematic shrinking using distillation + pruning + quantization. - Weight
Tying / Parameter Sharing
Reusing parameters across layers/modules to reduce footprint. - Checkpoint
Averaging (EMA / SWA)
Averaging weights for stability and generalization.
D. Systems & Memory Optimization (Making
training/inference feasible)
- Gradient
Checkpointing
Recomputation to trade compute for memory during training. - Activation
Checkpointing / Recomputation
Selective activation storage to reduce peak VRAM usage.
E. Knowledge Access & Control at Inference (What the
model can use)
- Retrieval-Augmented
Generation (RAG)
Injecting external knowledge at query time instead of storing all facts in weights.
F. Inference Speed & Serving Efficiency (Making it
fast & scalable)
- Speculative
Decoding
Drafting tokens with a small model and verifying with a large one to cut latency. - Early-Exit
Networks
Exiting at intermediate layers when confidence is high. - Inference
Caching (KV Cache Optimization)
Reusing attention states to avoid recomputation in long contexts or repeated queries. - Dynamic
Batching
Packing variable-length requests to maximize GPU utilization.
G. Deployment & Lifecycle (Operationalizing models)
- Knowledge
Transfer (Generalized Teacher–Student Transfer)
Broad umbrella concept for transferring capabilities across model sizes, domains, or modalities.
Details on all twenty concepts
(click any head in left panel to see details)
- [vtab]
- (1) Core Model Capacity & Architecture
- Mixture of Experts (MoE) - Mixture of Experts is an architectural technique where a model contains many specialized subnetworks called experts, but only a small subset is activated for each token. A routing network decides which experts to use dynamically at inference and training time. This allows the model to scale parameter count massively without proportionally increasing compute cost per token. MoE models achieve high capacity and specialization while keeping inference efficient. The trade-off is system complexity: expert load balancing, communication overhead, and routing instability can hurt training stability. MoE is especially valuable for large-scale multilingual and domain-diverse foundation models.
- (2) Training & Learning Strategies
- Curriculum Learning - Curriculum learning structures training data from simpler examples to harder ones instead of using random sampling. The idea mirrors human learning: models first grasp basic patterns, then progressively learn complex structures. In LLMs, curricula may involve starting with short sequences, clean text, or high-confidence data, and gradually introducing longer contexts, noisy data, or harder reasoning tasks. This improves convergence speed, stability, and generalization. Poorly designed curricula can bias models toward easy distributions and delay exposure to rare but important patterns. Effective curricula require careful dataset staging and adaptive difficulty scheduling.
- Contrastive Learning - Contrastive learning trains models to bring semantically similar representations closer while pushing dissimilar ones apart in embedding space. It is widely used in multimodal models (text-image, audio-text) and representation pretraining. In LLM pipelines, contrastive objectives help align sentence embeddings, retrieval encoders, and multimodal representations. This improves semantic search, clustering, and retrieval quality. The challenge is constructing high-quality positive and negative pairs without introducing spurious correlations. Hard negatives improve representation quality but increase training instability. Contrastive learning is foundational for building high-quality embedding models used in RAG and semantic indexing systems.
- Self-Training (Pseudo-Labeling) - Self-training expands labeled data by letting a trained model generate labels for unlabeled examples, which are then reused as training targets. In LLMs, this is used to bootstrap instruction datasets, domain adaptation, and synthetic supervision pipelines. High-confidence predictions are filtered to reduce noise. This method is powerful for low-resource domains and languages, but error accumulation is a major risk: model biases can reinforce themselves over generations. Careful filtering, diversity constraints, and teacher–student separation help reduce collapse. Self-training is widely used in speech recognition, translation, and domain-specific LLM adaptation.
- Knowledge Distillation (Teacher–Student Training) - Knowledge distillation transfers capabilities from a large teacher model to a smaller student model by matching outputs, logits, or intermediate representations. This allows deployment of compact, cheaper models while retaining much of the teacher’s performance. Distillation is central to building edge LLMs, domain-specialized assistants, and low-latency inference systems. Techniques include soft-label distillation, response imitation, and feature alignment. Distilled models inherit biases and limitations of the teacher, so teacher quality matters greatly. Distillation is often combined with quantization and pruning to create production-grade compressed models for cost-sensitive environments.
- Parameter-Efficient Fine-Tuning (PEFT / LoRA / Adapters) - PEFT adapts large pretrained models by training only a small subset of parameters instead of full fine-tuning. Methods like LoRA insert low-rank update matrices into attention layers, while adapters add small trainable modules. This drastically reduces memory usage, training cost, and catastrophic forgetting. PEFT enables multi-task customization where one base model serves many domains with different adapter weights. Performance is often close to full fine-tuning for many tasks, but extreme domain shifts may still require full updates. PEFT is now standard practice for enterprise customization of foundation models.
- Prompt Tuning / Soft Prompts - Prompt tuning learns continuous embeddings prepended to model inputs to steer behavior without changing core weights. Unlike hand-written prompts, soft prompts are trainable vectors optimized for tasks or domains. This allows lightweight task adaptation and rapid experimentation. Prompt tuning is memory efficient and avoids catastrophic forgetting, making it suitable for multi-tenant systems. However, expressivity is limited compared to full fine-tuning or adapters, and performance may plateau on complex reasoning tasks. Soft prompts are especially useful for classification, style control, and lightweight domain adaptation in constrained deployment environments.
- (3) Model Optimization & Compression
- Pruning - Pruning removes low-importance weights, neurons, attention heads, or entire layers to reduce model size and inference cost. Structured pruning removes components in blocks, making hardware acceleration easier than unstructured pruning. Pruning can significantly shrink models with limited performance loss if guided by importance metrics or sensitivity analysis. Over-aggressive pruning harms reasoning depth and robustness. Pruned models may require retraining or distillation to recover performance. Pruning is often used in mobile and edge deployment scenarios where memory and latency budgets are strict. It complements quantization and distillation in compression pipelines.
- Quantization - Quantization reduces numerical precision of weights and activations from FP16/FP32 to INT8 or INT4, cutting memory footprint and speeding up inference. Modern LLM quantization techniques preserve accuracy using calibration, smooth quantization, or group-wise scaling. Quantization-aware training improves robustness, while post-training quantization offers fast deployment. Lower precision increases sensitivity to outliers and can degrade long-context reasoning or arithmetic precision. Hardware support varies by platform. Quantization is a key enabler for consumer GPUs and edge devices, allowing large models to run within limited memory and power budgets.
- Model Compression (Combined Pipeline) - Model compression is a systematic pipeline combining distillation, pruning, and quantization to produce compact, deployable models. Each method addresses different inefficiencies: distillation transfers knowledge, pruning removes redundancy, and quantization reduces precision overhead. Compression pipelines are tuned for target hardware, latency budgets, and accuracy requirements. Poorly tuned compression can compound errors and severely degrade performance. Compression is essential for scaling LLM deployment economically, especially for consumer applications and large-scale inference workloads. Production-grade compressed models often require careful evaluation on domain-specific benchmarks to ensure reliability.
- Weight Tying / Parameter Sharing - Weight tying reuses parameters across layers or components, reducing memory footprint and improving consistency in learned representations. Classic examples include sharing input and output embeddings or tying transformer layer weights. Parameter sharing reduces overfitting and improves parameter efficiency, but limits expressivity and depth-specific specialization. In LLMs, partial sharing can stabilize training while controlling model size. Over-sharing can collapse representational diversity and hurt reasoning capacity. This technique is often used in multilingual models and resource-constrained architectures where memory efficiency is critical without heavy reliance on compression methods.
- Checkpoint Averaging (EMA / SWA) - Checkpoint averaging combines weights from multiple training checkpoints to improve stability and generalization. Exponential Moving Average (EMA) smooths parameter updates during training, while Stochastic Weight Averaging (SWA) averages final checkpoints across epochs. These methods reduce sensitivity to noise and sharp minima, often improving robustness and downstream performance. EMA is widely used in large-scale training for stable convergence. The cost is additional memory for shadow weights and slight training overhead. Averaging does not reduce model size but improves deployment reliability by producing smoother, better-generalizing parameter configurations.
- (4) Systems & Memory Optimization
- Gradient Checkpointing - Gradient checkpointing saves memory during training by discarding intermediate activations and recomputing them during backpropagation. This trades compute for memory, enabling larger models or longer sequences to fit on limited GPUs. It is essential for training frontier-scale LLMs and long-context models. The overhead is additional compute time during backward passes, increasing training cost. Checkpoint placement affects performance trade-offs. Gradient checkpointing is often combined with distributed parallelism and memory-optimized optimizers to train large models efficiently under hardware constraints.
- Activation Checkpointing / Recomputation - Activation checkpointing selectively stores only a subset of activations, recomputing others when needed. It reduces peak VRAM usage and enables training deeper networks or longer sequences. This is critical for multimodal and long-context LLMs with high memory pressure. Poor checkpointing strategies can significantly increase training time. Fine-grained control allows balancing memory savings and recomputation overhead. This technique is foundational to modern distributed training stacks, making otherwise infeasible model configurations trainable on real-world GPU clusters.
- (5) Knowledge Access & Control at Inference
- Retrieval-Augmented Generation (RAG) - RAG augments LLMs with external retrieval systems that fetch relevant documents at inference time. Instead of encoding all knowledge into model weights, RAG enables dynamic, updatable knowledge access. This improves factual accuracy, reduces hallucinations, and allows models to stay current. RAG systems require high-quality embeddings, retrieval pipelines, and prompt integration strategies. Latency, retrieval noise, and context window limits are key challenges. RAG shifts LLMs from static knowledge containers to interactive reasoning systems grounded in live or curated knowledge bases.
- (6) Inference Speed & Serving Efficiency
- Speculative Decoding - Speculative decoding uses a small draft model to propose token sequences that a larger model verifies in parallel. If accepted, multiple tokens are emitted per step, reducing latency. This accelerates inference without changing final outputs. The efficiency gain depends on alignment between draft and target models. Poor draft quality increases rejection rates and negates benefits. Speculative decoding is particularly useful for chat and real-time applications with strict latency constraints. It requires careful orchestration of verification steps and optimized serving infrastructure.
- Early-Exit Networks - Early-exit networks allow inference to terminate at intermediate layers when confidence is high, reducing average computation per query. This exploits the observation that many inputs are easy and do not require full-depth processing. Early exits improve latency and throughput in high-volume systems. Confidence calibration is critical; premature exits can degrade accuracy and reasoning depth. This technique is useful in classification, filtering, and triage tasks. It is less suitable for complex reasoning or generation tasks where later layers contribute significantly to coherence and planning.
- Inference Caching (KV Cache Optimization) - KV cache optimization stores attention key–value tensors from previous tokens to avoid recomputation during autoregressive generation. This drastically reduces compute for long contexts and multi-turn conversations. Efficient cache management enables long-context LLMs to scale in production. Challenges include memory overhead, cache eviction policies, and multi-user isolation. Cache fragmentation and memory bandwidth become bottlenecks at scale. KV caching is foundational to real-time chat systems, streaming generation, and long-context applications, enabling practical deployment of large autoregressive transformers.
- Dynamic Batching - Dynamic batching groups incoming requests of varying lengths into shared batches to maximize GPU utilization and throughput. It reduces idle compute and improves cost efficiency for inference servers. Batching must balance latency and throughput; aggressive batching increases wait times. Padding inefficiency and sequence length variance complicate batching strategies. Modern serving systems use adaptive schedulers to batch similar-length requests. Dynamic batching is essential for high-traffic LLM APIs and enterprise deployments, enabling predictable performance under fluctuating load conditions.
- (7) Deployment & Lifecycle
- Knowledge Transfer - Knowledge transfer broadly refers to moving capabilities across models, domains, or modalities using teachers, ensembles, or curriculum-based supervision. This includes distillation, cross-modal alignment, domain adaptation, and continual learning. It allows organizations to reuse expensive training investments across multiple products and platforms. Effective knowledge transfer reduces data and compute requirements for new domains. Risks include bias propagation and loss of domain-specific nuance. Knowledge transfer is central to building scalable AI ecosystems where foundational models seed specialized models for enterprise, edge, and regional deployments.
IN ONE DIAGRAM
Let us see these together in one illustration. Then we present details on each of these twenty ideas below.
~ ~ ~
