Leap Nonprofit AI Hub

Efficient Sharding and Data Loading for Petabyte-Scale LLM Datasets

Efficient Sharding and Data Loading for Petabyte-Scale LLM Datasets Feb, 16 2026

Training a large language model on a petabyte of data isn’t just a matter of throwing more GPUs at the problem. If you try to load all that data into memory at once, your system will crash before the first batch finishes. The real challenge isn’t storage-it’s data flow. How do you get data from a cloud bucket to a GPU, fast enough to keep it busy? How do you split it so no single node gets overloaded? And how do you shuffle it so the model doesn’t memorize patterns from the order it sees data? This is where sharding and smart data loading become the backbone of scalable LLM training.

Why Sharding Isn’t Optional Anymore

Imagine you’re training a model with 70 billion parameters. Each parameter is stored as a 32-bit float-that’s about 280 GB just for the weights. But training isn’t inference. During training, you also need to store gradients and optimizer states. That triples the memory footprint. So now you’re looking at nearly 1 TB of memory per GPU. Even the most powerful GPU on the market today can’t hold that. Enter sharding.

Sharding breaks the model’s data-parameters, gradients, and optimizer states-into smaller pieces and spreads them across multiple GPUs. Each GPU handles only a fraction. This isn’t just about memory. It’s about enabling training at all. Without sharding, models larger than a few billion parameters simply can’t be trained on existing hardware.

The scaling factor is brutal: memory needed ≈ 3 to 5 times the model size. Half-precision formats (bfloat16) cut that in half, but even then, models like DeepSeek-R1-671B still demand more than any single machine can offer. Sharding makes it possible.

How Sharding Works in Practice

Sharding isn’t one technique-it’s a family of strategies. The two most common are sharded data parallelism and tensor parallelism, and they’re often used together.

In sharded data parallelism, each GPU gets a different shard of the optimizer state and model parameters. Gradients are computed locally, then synced across shards. This reduces memory per GPU without sacrificing the benefits of data parallelism. For example, training a GPT-NeoX-65B model on 64 ml.p4d.24xlarge instances uses degree-2 sharded data parallelism. That means the model is split across two GPUs per node, cutting memory pressure in half.

Tensor parallelism takes it further. Instead of splitting the whole model across nodes, it splits individual layers across multiple GPUs. A single attention layer might be split across 64 GPUs. This is essential when you’re working with long sequences-say, 512 tokens per sample-because each sequence requires a lot of intermediate activations that eat up memory.

The most powerful setups combine both. A cluster of 1,536 GPUs (192 nodes, 8 GPUs each) might use 32-way sharded data parallelism and 64-way tensor parallelism. That means each shard handles 1/32 of the optimizer state, and each tensor layer is split across 64 GPUs. The result? A global batch size of 768, with each batch containing 3 million tokens. That’s the kind of scale needed to train state-of-the-art models.

Where the Data Lives: Storage Architecture

You can’t train on data that’s stuck in a cloud bucket. You need to move it fast. That’s why petabyte-scale training pipelines use a tiered storage approach.

At the bottom, you have object storage: AWS S3, Google Cloud Storage, or Azure Blob. These are cheap, durable, and can hold petabytes. But they’re slow to read from-especially when thousands of GPUs are asking for data at once.

In the middle, you have distributed file systems like HDFS, CephFS, or WekaIO. These act like a single, high-speed filesystem that spans dozens or hundreds of servers. They’re mounted directly on your training nodes and serve as a caching layer. Data is copied from S3 into this layer before training starts, so GPUs read from local disks, not over the network.

Above that, you have data lakes or lakehouses built on Apache Iceberg, Delta Lake, or Hudi. These aren’t storage-they’re metadata engines. They track schema changes, manage versioning, and ensure you’re always training on the right version of your dataset. They make it possible to roll back to a previous version if your model starts performing poorly.

The key is to keep the hot data-what’s being used right now-on fast storage, and keep the cold data-what’s archived-on cheap storage. This balance keeps costs down and speeds up training.

Tiered storage system showing data flowing from S3 to distributed filesystem to GPU node, with glowing server racks and real-time data decoding.

How Data Gets to the GPU: Loading Frameworks

Sharding the data is only half the battle. The other half is loading it without starving your GPUs.

If your data loader can’t feed batches fast enough, your GPUs sit idle. That’s a waste of millions of dollars in compute. So the loading pipeline has to be as optimized as the sharding strategy.

Frameworks like NVIDIA DALI, WebDataset, and TensorFlow’s tf.data are built for this. They don’t just read files-they prefetch, decode, and shuffle in parallel.

For example, WebDataset uses .tar.lz4 shards-compressed archives with thousands of samples inside. Each shard is a single file. When a node starts training, it pulls a few shards into memory, decodes them in parallel, and feeds batches to the GPU. While one batch is training, the next is already being decoded. This pipelining keeps the GPU busy.

Shuffling is another big challenge. If you shuffle within each shard, you get local randomness but not global randomness. The solution? Shuffle the order of shards first, then shuffle samples within each shard. Tools like AIStore do this automatically, ensuring every GPU sees a globally randomized sequence.

Batch Size: Start Small, Scale Smart

One of the most common mistakes is starting with a huge batch size. That’s like trying to run a marathon before you’ve learned to walk.

Best practice? Start with batch size 1 per GPU. Increase it slowly until you hit an out-of-memory error. Then back off by 10%. This gives you the largest batch size your hardware can handle without crashing.

If you can’t even train with batch size 1, you need more sharding. Increase the sharded data parallelism degree. Or combine it with tensor parallelism. The goal isn’t to use the biggest batch possible-it’s to use the biggest batch that trains stably and converges well.

Larger batch sizes improve convergence and generalization. But only if they’re supported by your hardware. Pushing beyond what your sharding setup can handle leads to instability, not speed.

Real-World Data Sizes: Less Is More

There’s a myth that bigger datasets always mean better models. Not true.

Llama-3.1-8B was trained on just 1 million tokens. DeepSeek-R1-Distill-Qwen-14B used 50 million. Llama-3.3-70B-Instruct? Only 119 million. These aren’t typos. They’re proof that targeted data beats brute-force volume.

Domain adaptation is changing the game. A cybersecurity LLM trained on 118.8 million tokens outperformed models trained on 2.77 billion tokens. Why? Because the data was relevant. It wasn’t just more-it was better.

This means you don’t need petabytes to train a powerful model. You need the right petabytes. Sharding lets you handle massive datasets, but smart curation lets you avoid them altogether.

Technician adjusting a server rack with holographic data shuffle map and real-time throughput metrics reflected on polished metal surfaces.

Custom Sharding: Tailoring Data to Your Task

Not all data is created equal. Sometimes, you don’t need to keep every label separate.

With ImageNet, you might have thousands of classes: ‘tench’, ‘goldfish’, ‘bass’, ‘salmon’. But if you’re building a fish classifier, do you really need to distinguish between all of them? Probably not.

Custom sharding lets you group similar classes. Tools like ishard let you define mapping rules: map ‘tench’ and ‘goldfish’ to ‘fish’. Now your shards are smaller, easier to load, and your model learns broader patterns.

This isn’t just for images. Text data can be grouped by topic, language, or source. Code data can be grouped by language or framework. The point is: sharding isn’t just about splitting-it’s about organizing.

Scaling the Pipeline: Storage Must Keep Up

The biggest bottleneck in petabyte-scale training isn’t the GPU. It’s the storage.

If your distributed filesystem can’t deliver 10 GB/s across 100 nodes, your GPUs will starve. That’s why scalable storage isn’t a luxury-it’s a requirement.

Modern systems scale by adding nodes. Add more storage servers, and your throughput increases. Add more network switches, and your bandwidth grows. This horizontal scaling means you can grow your training pipeline as your model grows.

The goal? Make storage invisible. When a GPU asks for data, it should get it in milliseconds-not seconds. That’s only possible with a well-designed, horizontally scalable storage layer.

What’s Next: The Future of Sharding

Sharding and data loading are evolving fast. New frameworks like SageMaker Model Parallelism v1.15+ support PyTorch 1.13.1 and beyond, making it easier to combine sharded and tensor parallelism without custom code.

Research is moving toward adaptive sharding-where the system automatically adjusts shard size and distribution based on GPU utilization, network load, and data access patterns. Imagine a system that notices one shard is slower to load and redistributes its data on the fly.

We’re also seeing more integration between data curation and sharding. Tools are emerging that don’t just split data-they analyze it, clean it, and shard it in a way that maximizes learning efficiency.

The future isn’t just bigger models. It’s smarter pipelines. And that starts with how you handle the data.

What’s the difference between sharded data parallelism and tensor parallelism?

Sharded data parallelism splits the model’s parameters, gradients, and optimizer states across multiple GPUs to reduce memory per device. Each GPU trains on a different shard of the data, but the full model is reconstructed during forward and backward passes. Tensor parallelism, on the other hand, splits individual layers-like attention heads or MLPs-across multiple GPUs. This reduces memory for intermediate activations, which is critical for long sequences. They’re often used together: sharded data parallelism handles memory for parameters, while tensor parallelism handles memory for activations.

Do I need petabytes of data to train a good LLM?

No. Recent models like Llama-3.1-8B and DeepSeek-R1-Distill-Qwen-14B were trained on datasets under 100 million tokens. What matters more than size is quality and relevance. A focused dataset with high-quality, domain-specific data often outperforms a massive but noisy one. Sharding lets you handle petabytes, but smart data selection lets you avoid them.

Why can’t I just use a single high-end GPU for training?

Training requires storing the model, gradients, and optimizer states simultaneously. This triples the memory needed compared to inference. Even a 70B model needs nearly 1 TB of GPU memory to train. No single GPU today has that much memory. Sharding distributes this load across many GPUs, making training possible.

How do I avoid GPU idling during training?

Use a prefetching data loader like WebDataset or NVIDIA DALI. These tools decode and shuffle data in the background while the GPU is training. They also keep multiple batches ready in memory. Combine this with a high-speed distributed filesystem (like WekaIO or CephFS) to serve data from local disks instead of over the network. This keeps the data pipeline full and the GPU busy.

What’s the best way to shuffle data across shards?

Don’t shuffle within each shard alone-that creates local randomness but not global randomness. Instead, shuffle the order of the shards first, then shuffle samples within each shard. Tools like AIStore automate this by globally shuffling shard names and using client-side shuffle buffers. This ensures every GPU sees a truly randomized sequence, improving model generalization.

Can I train on data stored only in S3?

Technically yes, but you’ll be bottlenecked. S3 is slow for high-throughput, low-latency reads. When thousands of GPUs request data simultaneously, S3 can’t keep up. Use S3 as your archive, then copy data into a distributed filesystem like HDFS or WekaIO before training. This gives you the cost savings of object storage with the speed of local storage.

How do I know if my sharding setup is working?

Monitor GPU utilization. If it’s consistently below 80%, your data loader is too slow. Check your data pipeline’s throughput-aim for at least 5-10 GB/s per node. Use profiling tools like PyTorch Profiler or NVIDIA Nsight Systems to trace delays. If data loading takes longer than computation, you need faster storage, better sharding, or a more efficient loader.

Is sharding only for training, or does it help with inference too?

Sharding is primarily for training. Inference can often run on a single GPU using quantization or offloading. Sharding isn’t needed because inference doesn’t require storing gradients or optimizer states. However, some large inference systems use model parallelism (similar to tensor parallelism) to split very large models across multiple GPUs for faster response times.