Rotary Position Embeddings and ALiBi: How Modern LLMs Handle Position Without Learned Embeddings
Mar, 16 2026
Most people think transformers need to learn where each word sits in a sentence. That’s not true anymore. The latest big language models like Llama, Falcon, and GPT-NeoX don’t use traditional positional embeddings at all. Instead, they use smarter, simpler tricks built right into the attention mechanism. Two of the biggest breakthroughs here are Rotary Position Embeddings and ALiBi. They solve the same problem in completely different ways-and both are changing how we build models that handle long texts.
Why Position Matters More Than You Think
Transformers work by comparing every word to every other word. But if you don’t tell the model which word came first, second, or tenth, it treats "The cat sat" and "Sat the cat" as the same thing. That’s called permutation invariance-and it’s a disaster for language. Early transformers used fixed sinusoidal patterns added to word embeddings. Think of it like tagging each word with a unique number based on its position. But that system had big flaws. It didn’t scale. If you trained on 2,048 tokens and tried to run it on 4,096, performance dropped hard. It also needed a separate lookup table for every possible position, which blew up memory and made caching inefficient. The real shift came when researchers realized: don’t mix position with meaning. Instead of adding position info to the word vector, change how attention scores are calculated. That’s where RoPE and ALiBi come in. They don’t touch the input embeddings. They live inside attention.Rotary Position Embeddings: Rotating Vectors to Remember Order
Imagine taking a 2D vector and spinning it around a circle. The angle of rotation tells you where it is in the sequence. That’s the core idea behind Rotary Position Embeddings (RoPE). Instead of adding a number to the embedding, RoPE applies a rotation matrix to the query and key vectors inside attention. Each position gets its own rotation angle, based on a cosine-sine pair. For example, position 0 might rotate by 0°, position 1 by 10°, position 2 by 20°, and so on. Here’s why this is genius: when you compute the dot product between a query and a key, the result naturally depends on the difference between their rotation angles. So if a query at position 5 looks at a key at position 10, the dot product reflects a 5-step gap. No lookup tables. No extra parameters. Just math. RoPE was first used in Llama and quickly became the go-to for open-source models. It’s elegant, stable, and works across modalities-not just text, but also images and audio. You can train on 8K tokens and infer on 100K with a simple trick: scale the rotation angles by the ratio of new length to training length. That’s why models like Mistral and Qwen handle 128K context without retraining. But RoPE isn’t perfect. It’s computationally heavier than ALiBi. The rotations require extra matrix operations, and while they’re fast on modern GPUs, they still add overhead. More importantly, RoPE’s theoretical strength doesn’t always translate to real-world extrapolation. If you push it too far beyond its training length, performance can dip.ALiBi: Biasing Attention with Distance
ALiBi (Attention with Linear Biases) takes a completely different path. Instead of rotating vectors, it adds a simple linear penalty to attention scores. The farther apart two tokens are, the more you subtract from their attention score before the softmax. Here’s how it works: for every query-key pair, you calculate their distance. Say the query is at position 10 and the key is at position 3. The distance is 7. You multiply that by a slope value (like -0.5) and subtract the result from the attention score. So attention scores naturally favor nearby tokens. No extra vectors. No memory growth. No learnable parameters. ALiBi was first used in GPT-NeoX-20B and quickly proved its worth. It’s faster to train than RoPE because there’s less computation per layer. It also extrapolates better. A model trained on 4K tokens can handle 16K or even 32K without tuning. That’s because ALiBi doesn’t rely on learned patterns-it uses a fixed, interpretable rule: distance = less relevance. But ALiBi had one early flaw: when you extended context, the attention scores got too small. The bias was so strong it drowned out the actual content signals. In 2023, researchers fixed this with dynamic slope scaling. The trick? Multiply the slope by L/L′, where L is the training length and L′ is the new inference length. So if you trained on 4K and now run on 16K, you scale the slope by 4K/16K = 0.25. That keeps attention scores in the right range. This tweak made ALiBi the preferred choice for long-context applications.
Side-by-Side: RoPE vs ALiBi
| Feature | Rotary Position Embeddings (RoPE) | Attention with Linear Biases (ALiBi) |
|---|---|---|
| How it works | Rotates query and key vectors using trigonometric functions | Adds linear distance penalty to attention scores |
| Learnable parameters | Zero | Zero |
| Memory overhead | Constant | Constant |
| Extrapolation strength | Good with scaling, but degrades beyond training length | Excellent, especially with dynamic slope scaling |
| Training speed | Slower due to rotation operations | Faster, simpler math |
| Implementation complexity | Higher-requires custom attention kernel | Lower-just add bias term |
| Adopted in | Llama, Llama 2, Falcon, Qwen | GPT-NeoX-20B, MPT, some vision transformers |
Why Both Are Better Than Old-Style Positional Embeddings
Before RoPE and ALiBi, models used learned positional embeddings. Each position had its own vector. So if you trained on 10K tokens, you needed 10K vectors. That’s not just memory-heavy-it’s inflexible. Change the context length? You need to retrain. RoPE and ALiBi don’t do that. They’re parameter-free. The same math works for 1K, 10K, or 100K tokens. That’s why modern models can handle context lengths that were unthinkable five years ago. They also make attention more interpretable. With RoPE, you can see how rotation angles encode distance. With ALiBi, you can literally plot the bias and see how attention drops off linearly. That’s huge for debugging and improving models.
Which One Should You Use?
If you’re building a general-purpose LLM and want to match Llama’s performance, RoPE is your safe bet. It’s battle-tested, widely supported, and integrates smoothly with existing attention code. If you’re focused on long-context tasks-think legal documents, research papers, or codebases-ALiBi with dynamic scaling is the better choice. It trains faster, extrapolates better, and uses less memory during inference. Some teams are even combining them. For example, a model might use ALiBi for cross-attention layers and RoPE for self-attention. The field is moving toward hybrid designs.The Bigger Picture: Position Is a Separate Dimension
The real win here isn’t the math. It’s the mindset shift. We used to think position was just another feature to add to the word vector. Now we know: position is its own dimension. It’s not about what the word means-it’s about where it sits. That’s why these methods work so well. They keep semantics clean and position explicit. This is why vision transformers are starting to use ALiBi too. In images, position matters just as much as color or texture. ALiBi’s linear bias works naturally in 2D grids. RoPE’s rotation works well for sequences, but ALiBi’s simplicity shines in spatial data.What’s Next?
Researchers are already pushing beyond these two. New variants like RoPE with dynamic scaling and ALiBi with adaptive slopes are being tested. There’s also work on combining them with recurrent components, like state-space models. The goal? Build models that handle ultra-long sequences-think 1M tokens-with zero loss in quality. One thing’s clear: the age of learned positional embeddings is over. The future belongs to methods that are mathematically elegant, computationally efficient, and infinitely scalable. RoPE and ALiBi aren’t just improvements-they’re the new standard.Do RoPE and ALiBi need to be trained?
No. Both RoPE and ALiBi are completely parameter-free. They don’t add any learnable weights to the model. RoPE uses fixed rotation matrices based on cosine and sine functions. ALiBi uses fixed linear slopes. The model learns how to use these position signals during training, but the position encoding itself is not updated.
Can I use ALiBi with models that originally used RoPE?
Technically yes, but it’s not straightforward. You’d need to retrain the model because the attention patterns learned under RoPE won’t transfer directly to ALiBi. The two methods encode position differently, so the model’s weights are optimized for one or the other. Switching mid-training or after training will hurt performance unless you retrain from scratch.
Why do some models still use sinusoidal embeddings?
Mostly for legacy reasons or small-scale experiments. Sinusoidal embeddings were used in the original Transformer paper and are still found in older codebases. But they don’t scale well. If you’re building a real-world model today, you’d use RoPE or ALiBi instead. No major open-source LLMs released after 2023 use traditional positional embeddings.
Which is faster: RoPE or ALiBi during inference?
ALiBi is faster. It adds a simple linear term to attention scores-just one subtraction per attention pair. RoPE requires rotating query and key vectors using sine and cosine operations, which involve more floating-point calculations. In practice, ALiBi reduces attention layer latency by 5-15% depending on context length and hardware.
Are RoPE and ALiBi used outside of language models?
Yes. RoPE has been adapted for vision transformers, speech recognition, and even geospatial data. ALiBi is being used in multimodal models that combine text and images because its linear bias works naturally in 2D grids. Both are becoming standard in any transformer-based system that needs to handle ordered sequences.