Scaling Context Size for Llama 4 Model Training

Explore top LinkedIn content from expert professionals.

Summary

Scaling context size for Llama 4 model training means increasing the amount of text the model can process at once, allowing it to handle longer conversations or documents. This process is challenging because it dramatically increases computational needs and often requires innovative strategies to keep training practical and affordable.

  • Explore architectural changes: Try techniques like sliding window attention to reduce computation requirements when expanding context size, making longer sequences more manageable.
  • Utilize memory-saving methods: Adopt tools such as gradient checkpointing and Cut Cross Entropy to lower hardware demands, enabling training with much longer contexts.
  • Simulate and fine-tune: Consider using synthetic data or test-time training to extend context windows and refine the model’s performance when handling extensive input.
Summarized by AI based on LinkedIn member posts
  • View profile for Anis ZAKARI

    Head of AI Engineering

    9,059 followers

    It seems that even for companies like Meta, OpenAI, or Google, training an LLM from scratch with a very long context window, such as 128k to 1M tokens, isn't reasonable because: - The cost is prohibitively high. For instance, training LLaMA 2 with a 4k context window was approximately twice as expensive as training LLaMA 1 with a 2k context window (even though LLaMA 2 was trained on 40% more tokens). If this extrapolation holds, it would mean that extending the context window from 8k to 128k could result in at least a 16x increase in cost. - There are hardly any long-sequence datasets publicly available. In other words, you would need to create a dataset of sequences up to 128k or 1M tokens for the training to be meaningful. Currently, most people likely use synthetic data to stitch sequences together and somehow maintain coherence, but this approach is challenging and incredibly tedious. So, how do you produce long-context LLMs without breaking the bank? I suspected that for such “LongLLMs,” the pre-training was done in two stages. The first stage uses a lower context window with the majority of the dataset, and the second stage uses much less data but with an extended context window. Yesterday, I took the time to read the LLaMA 3 405B paper and came across this sentence: “We pre-train a model with 405B parameters on 15.6T tokens using a context window of 8K tokens. This standard pre-training stage is followed by a continued pre-training stage that increases the supported context window to 128K tokens.” So, there it is! Even though this approach works, I suspect the model may retain more "bias" towards the shorter sequences seen during the initial training, potentially affecting its ability to handle long sequences with the same accuracy, thus leaving room for hallucinations and loss of information. The question now is, how do you properly evaluate long-context LLMs? Such evaluations would obviously be quite costly and time-consuming. I wanted to perform some empirical tests to evaluate how accurate the LLM is for different sequence lengths (I'm expecting to see a decline in accuracy after 20-30k tokens) but couldn't find any benchmarks or evaluations that were satisfying. Any ideas?

  • View profile for Philipp Schmid

    AI Developer Experience at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

    165,273 followers

    Llama 3 extended to almost 100,000-token context! ✅ By Combining PoSE and continuing pre-training on Llama 3 8B base for 300M tokens, the community (Wing Lian) managed to extend the context from 8k to 64k. 🚀 Applying rope scaling afterward led to a supported context window of close to 100,000 with perfect recall. 🤯🚀 PoSE can extend the context window of LLMs by simulating long inputs using a fixed context window during training. It chunks the document into smaller pieces and simulates them as “long” versions, which significantly reduces memory and time overhead while maintaining performance. 𝗜𝗻𝘀𝗶𝗴𝗵𝘁𝘀: 🚫 Don't increase rope_theta during pertaining 🚀 Rank-stabilized LoRA converged much quicker than regular LoRA ⬆️ Increased the RoPE theta to extend the context to ~90k ➕ Adapters can be merged with any Llama 3 model to extend the context Llama 3 8B 64k: https://lnkd.in/d6cprxvT Original Thread: https://lnkd.in/dnVn8vKu PoSE Paper: https://lnkd.in/dmtDNwwe

  • View profile for Anshuman Mishra

    ML @ Zomato

    29,101 followers

    You’re in a Machine Learning interview at Meta, and the interviewer asks: Why is scaling context length so hard? What’s the fundamental bottleneck?’ Here’s how you answer: Don’t say: ‘Memory limitations’ or ‘GPU constraints.’ Wrong framing. The real answer is the O(n²) complexity of self-attention. Every token must attend to every other token. Double your context? You 4x your compute. It’s not linear scaling - it’s exponential pain. NOTE - if you want to read this kinda content daily, consider subscribing my free newsletter - https://lnkd.in/gCPD6fUz Here’s the math that breaks everything: → 8K context = 64M attention computations. → 128K context = 16B computations → 1M context = 1T computations. Your compute cost doesn’t grow with n, it grows with n². This is why throwing money at the problem doesn’t solve it. The architectural constraint everyone misses: - Full attention means an 8×8 matrix for 8 tokens, 128K×128K matrix for 128K tokens. - Memory isn’t just storing tokens - it’s storing every possible token relationship. - At 1M tokens, you need to materialize a 1M×1M attention matrix. That’s 1 trillion float values. This is why context length is the great divider: - GPT-4, Llama = 128-200K max (hitting the O(n²) wall) - Claude, Gemini 1M = Heavy optimization + unknown architecture tricks + $$$ - Most open source = 32K or less (the practical ceiling) - Startups = Can’t afford to compete on context ‘So how do you actually scale past 128K?’ Interviewer leans forward. This is where Sliding Window Attention (SWA) can help. > Instead of every token attending to every token, each token only attends to a fixed window around it. > Complexity drops from O(n²) to O(n×w). Suddenly 1M tokens becomes feasible. The clever part: local attention + deep layers = global understanding. Token 1 doesn’t directly see token 100K, but through 32 layers of propagation, information flows across the entire sequence. Like how CNNs build from edges to objects - each layer has local receptive field but stacks create global vision. The implementation trick that makes it work: Split Q and K into overlapping chunks (size 2w, overlap w). Do attention within chunks only. One PyTorch matmul operation. Yes, you compute 2x more than theoretically optimal. But you go from “impossible” to “runs on one GPU.” That’s the tradeoff that matters. The answer that gets you hired: ‘Scaling context is hard because attention is fundamentally O(n²). Full attention at 1M tokens is computationally intractable for most. The solution isn’t more hardware - it’s architectural changes like SWA that break the quadratic bottleneck. Gemini likely uses SWA variants + massive optimization. The math dictates the limits.’ The follow-up that makes you stand out: ‘The interesting question isn’t “how do we get 1M context” - it’s “do we need it?” #machinelearning #rag #meta #interview #hiring #cs #ai #datascience #jobs #llm #chatgpt

  • View profile for Daniel Han

    Co-founder @ Unsloth AI

    64,436 followers

    Ultra long context fine-tuning is here! Unsloth can now do 89K contexts on a 80GB GPU for Meta Llama 3.3 (70B) in 4bit - 13x longer than HF+FA2! 1. We worked with the Cut Cross Entropy authors from Apple to make it work in Unsloth AI. CCE is like FA2, but for cross entropy. Via on the fly matrix mults, one does not have to materialize the logits at all, cutting VRAM usage by a lot. 2. CCE also skips computation on small probabilities in the gradient, which boosts its performance whilst maintaining fantastic accuracy. CCE alone pushes 70B to do 1.85x longer context lengths, and 8B to do 3.6x longer lengths. 4. By combining Unsloth's gradient checkpointing algo which smartly offloads activations to system RAM, we can multiply the effects of CCE and GC. Unsloth's GC alone pushes 70B to do 7x longer lengths and 8B does 3.3x longer lengths. In total, 12 - 13x longer lengths are possible. 5. We found CCE to use a bit more VRAM on sequences < 1024 than our own custom CE Loss kernel - we auto dispatch from now on. 6. System RAM usage for Unsloth gradient checkpointing is = Number Layers * Hidden Size * 2 bytes (bf16) * Sequence Length * Batch Size. For Llama 70B, it has 80 hidden layers, and 8192 dim. 89K context lengths * 80 * 8192 * 2 = 109GB of RAM. Llama 3.1 8B has 32 layers, 4096 dim. 128K context lengths will need 32GB of system RAM. 342K - 84GB. 4bit QLoRA rank=32 all linear layers padded seq lens. * Blog post: https://lnkd.in/gyhD8in8 * Cut Cross Entropy: https://lnkd.in/ghevkbFd * Unsloth repo: https://lnkd.in/gyaDBTxK * Llama 3.3 70B uploads: https://lnkd.in/gXQcEgpg * Unsloth GC blog: https://lnkd.in/gB2bn-TK * Finetune Llama 3.2 1/3B for free in a Google Colab: https://lnkd.in/gwHrmXjU

  • View profile for Jure Leskovec

    Professor at Stanford Computer Science and Co-Founder at Kumo.ai

    88,078 followers

    LLM memory is missing something fundamental. During pre-training, Llama 70B compresses the entire internet into 140GB of model weights. But just putting Steve Jobs’ Wikipedia page into the context window creates an 80GB key-value cache. If we want models that can efficiently reason over millions of tokens of context, we cannot simply dump everything into a context window. We need to continue training models at test-time, using long-context as training data to compress massive amounts of information directly into the model weights. Incredibly excited to share work led by my student Arnuv Tandon, in partnership with NVIDIA AI, that has been over a year in the making: End-to-End Test-Time Training for Long Context. As the title suggests, we continue training language models at test-time using the same next-token prediction objective as pre-training — allowing our model to scale with context length like full attention without maintaining a key and value for every token in the sequence. With linear complexity, our method is 2.7x faster than full attention at 128K tokens while achieving better performance. We believe test-time training is the key to unlocking a future with long-horizon agents, robots with human-like memory, and truly personal AI with your own model weights. Read the full paper: https://lnkd.in/g3f2BFcx Read the NVIDIA blog post: https://lnkd.in/gnWvS3Uk

Explore categories