Laws That Break: Where Large Language Model Scaling Expectations Fail
Jun, 23 2026
For years, the artificial intelligence community operated on a simple, comforting belief: bigger is better. If you wanted a smarter large language model, you just threw more money at it-more parameters, more data, more compute. The math seemed to promise a straight line upward. But that line has cracks in it. In fact, some of those cracks are wide enough to drive a truck through.
We’ve seen several "laws" of AI scaling break down under real-world pressure. The rules that worked for early experiments failed when we tried to build production systems. The assumptions that held for training loss collapsed when we cared about actual user experience. Today, I want to walk you through where these expectations fail and why understanding these failures is critical if you’re building or buying AI solutions in 2026.
The Chinchilla Shock: When Bigger Was Actually Worse
Let’s start with the most famous broken law. Back in 2020, researchers from Kaplan et al. published a paper that became the bible for AI engineers. It claimed that loss (a measure of error) scales predictably as a power law with model size, dataset size, and computational resources. The implication was clear: scale everything up proportionally, and you get predictable improvements.
Then came DeepMind’s Chinchilla 70-billion parameter model that redefined optimal compute allocation in 2022. This wasn’t just a new model; it was a correction to the industry’s navigation system. DeepMind trained Chinchilla using the exact same amount of compute as their previous 280-billion parameter model, Gopher. But they changed the ratio. Instead of making the model huge and keeping data small, they made the model smaller but fed it significantly more data-1.4 trillion tokens versus Gopher’s 300 billion.
The result? Chinchilla outperformed both Gopher and OpenAI’s 175-billion parameter GPT-3. It was smaller, cheaper to run, and smarter. This shattered the assumption that larger models automatically perform better. The old scaling laws were wrong because they didn’t account for the optimal balance between model size and data volume. For every 10x increase in compute, you shouldn’t just make the model bigger. You need to increase both model size and data size by roughly the same proportional amount (about 3.1x each). Ignore this balance, and you waste billions of dollars on inefficient models.
Production Reality vs. Training Theory
Here’s where it gets tricky. The Chinchilla law tells you how to minimize training loss efficiently. But do you care about training loss? Probably not. You care about how the model performs when your customers use it. This is the gap between "training optimality" and "production quality."
In production, models need to generalize well and handle edge cases. To achieve this, researchers discovered that you often need to "overtrain" models. This means giving them more data than the Chinchilla optimal calculation suggests. Meta’s LLaMA models, for example, were explicitly overtrained. They received substantially more training data than strict Chinchilla optimality would dictate. Subsequent studies found that scaling laws in this overtraining regime can push dataset sizes up to 32x more than the baseline optimal allocation.
This creates a major planning headache. If you follow the efficient scaling laws strictly, your model might be cheap to train but mediocre in practice. If you optimize for production performance, you blow past the theoretical efficiency frontier. The "law" breaks because the objective function changes. You aren’t just minimizing error anymore; you’re maximizing robustness and inference-time accuracy.
| Regime | Primary Goal | Data Allocation | Efficiency Trade-off |
|---|---|---|---|
| Chinchilla Optimal | Minimize Training Loss | Balanced with Model Size | High Compute Efficiency, Lower Generalization |
| Overtrained (Production) | Maximize Inference Performance | Up to 32x More Data | Lower Compute Efficiency, Higher Robustness |
| Test-Time Scaling | Complex Reasoning Accuracy | N/A (Inference Compute) | High Latency, High Accuracy |
Reinforcement Learning: The Wild West
If pretraining scaling laws are shaky, reinforcement learning (RL) scaling laws are nonexistent. Pretraining has "rigorous scaling laws" validated across orders of magnitude. RL does not. Why? Because RL introduces high variance. During RL training, single tokens can dominate loss expressions or cause numerical instability. This is especially bad when training large language models on long sequences or using Mixture-of-Experts architectures.
You can’t just plug numbers into a formula and predict RL outcomes. Best practices for RL are often anecdotal and dependent on specific training settings. Researchers have to test design choices the hard way-by running massive, expensive experiments and seeing what doesn’t crash. This creates a bottleneck that limits iteration speed. Unlike pretraining, where you know doubling compute gives you X% improvement, RL scaling is unpredictable. A method that works for a 7B parameter model might completely fail for a 70B model. This lack of predictive power makes RL integration one of the riskiest parts of modern LLM development.
Safety Doesn't Scale Linearly
Another area where scaling laws break is safety. We assume that as models get smarter, they also get safer, or at least that safety properties scale predictably alongside capabilities. This is false. Adversarial attacks, like jailbreaks, follow entirely different mathematical relationships.
Research shows that adversarial prompt-injection attacks can amplify success rates from slow polynomial growth to exponential growth with the number of inference-time samples. Think of it like this: standard capability scaling is a steady hill climb. Jailbreak scaling is a cliff. Short injected prompts act like weak magnetic fields, yielding power-law scaling. Long injected prompts act like strong magnetic fields, producing exponential scaling. This means that as you scale up your model’s capabilities, its vulnerability to sophisticated attacks can skyrocket in ways that traditional scaling metrics don’t capture. Safety is not a free rider on capability; it requires separate, non-linear engineering efforts.
Test-Time Compute: The New Frontier
Finally, let’s talk about test-time scaling. This is an emerging paradigm where we apply more compute at inference time rather than training time. Instead of training a bigger model, we let the existing model "think" longer. It performs multiple inference passes, working through complex problems step-by-step. This approach drives demand for accelerated computing but shifts the efficiency frontier. The scaling relationships for inference-time computation differ fundamentally from training-time allocations. You trade latency for accuracy. While this doesn’t "break" the idea of scaling, it breaks the expectation that all scaling happens during training. The value of compute moves from the factory floor (training) to the storefront (inference).
What This Means for Your Strategy
The repeated discovery of broken scaling laws teaches us a meta-lesson: scaling laws hold precisely only for narrowly-defined objectives, typically training loss in ideal conditions. Real-world constraints-finite data, inference requirements, safety needs, RL instability-systematically violate laboratory predictions. Different domains follow distinct scaling relationships that cannot be unified into a single framework. The gap between what scales predictably and what matters in practice creates perpetual friction.
If you’re investing in AI, don’t rely on universal scaling promises. Validate domain-specific returns. If you’re building models, remember that Chinchilla optimality is a starting point, not a finish line. Overtrain for production stability. Budget for the unpredictability of RL. And never assume that safety scales along for the ride.
Why did the Chinchilla paper change the AI industry?
The Chinchilla paper proved that previous scaling laws were suboptimal. It showed that for a fixed amount of compute, increasing both model size and dataset size proportionally yields better results than just increasing model size. This meant many existing large models were under-trained relative to their size, wasting computational resources.
Do scaling laws apply to Reinforcement Learning?
No, reliable scaling laws for RL do not currently exist. RL training suffers from high variance and instability, particularly with long sequences and large models. Predictions based on pretraining scaling laws often fail in RL contexts, requiring empirical testing rather than mathematical prediction.
What is "overtraining" in the context of LLMs?
Overtraining refers to providing a model with more data than the Chinchilla-optimal calculation suggests. While less computationally efficient for minimizing training loss, overtraining improves generalization and inference-time performance, which is critical for production-quality applications.
How do jailbreak attacks scale compared to model capabilities?
Jailbreak attacks can scale exponentially with inference-time samples, especially with long prompt injections, whereas standard capability improvements typically follow slower power-law scaling. This divergence means safety vulnerabilities can grow faster than helpful capabilities as models scale.
Is test-time scaling more efficient than training larger models?
It depends on the use case. Test-time scaling trades latency for accuracy by allowing models to reason longer. It is highly effective for complex reasoning tasks but increases inference costs and response times, making it less suitable for high-throughput, low-latency applications.