When Meta introduced the long-awaited subsequent era of its open-source model, Llama 4, debates emerged on social media about whether or not this marks the tip of retrieval-augmented era (RAG), as a result of model’s 10-million context window. The large context window permits the model to course of considerably massive quantities of knowledge in a single question, elevating a number of questions concerning the necessity of RAG.
LLAMA4 has 10M context window 😳
Why even RAG anymore? pic.twitter.com/fcPGV3hgKj
— Peter Yang (@petergyang) April 5, 2025
Shorter-context models typically depend on exterior retrieval to entry knowledge. Nonetheless, Llama 4’s bigger context allows it to handle extra data internally, thereby lowering the necessity for exterior sources when reasoning or processing static knowledge. However is that this adequate to indicate the tip of RAG?
Depart RAG Alone, Please
A number of builders and trade consultants rallied to defend RAG, which has confronted many challenges. Relating to prices, pushing 10 million tokens right into a context window is not going to be low-cost—it’s going to exceed a greenback per question and take ‘tens of seconds’ to generate a response, as indicated by Marco D’Alia, a software program architect on X.
Persons are saying the ten million context dimension of @meta Llama 4 means RAG is useless.
I’ve two questions for you:
1) Do you wish to spend $1+ for every message?
2) Do you wish to wait a VERY very long time on each message to course of all these tokens?
— Tristan Rhodes (@tristanbob) April 5, 2025
Moreover, many emphasised that longer context windows have been by no means meant to interchange RAG, whose capabilities primarily centered on including related chunks of knowledge to the enter.
“RAG isn’t about fixing for a finite context window, it’s about filtering for sign from a loud dataset. Irrespective of how large and highly effective your context window will get, eradicating junk knowledge from the enter will at all times enhance efficiency,” mentioned Jamie Voynow, a machine studying engineer on X.
Gokul JS, a founding engineer of Aerotime, summarised the whole debate with a easy analogy: “Think about handing somebody a dense web page of textual content, taking it away, then asking questions. They’ll keep in mind bits, not every little thing,” he mentioned in a put up on X. He added that LLMs are not any completely different in such conditions and that simply because they deal with extra context doesn’t at all times assure an correct response.
Moreover, a ten million context window is large, however it might not embody each use case. Granted, RAG use instances have definitely diminished with time, given how most AI models retrieve data from just a few PDFs with ease, however a number of sensible use instances might want to transcend that.
“Most enterprises have terabytes of paperwork. No context window can embody a pharmaceutical firm’s 50K+ analysis papers and many years of regulatory submissions,” said Skylar Payne, a former ML programs engineer at Google and LinkedIn.
It may make sense if we’re speaking about how gpt-3.5 used to have 4k context and we wanted RAG for an arxiv paper however we don’t should now.
Again to this time: Even with 10M context, we’ll in all probability nonetheless RAG for arxiv papers from 2025 alone, and I’m unsure loading 10M price…
— Eugene Yan (@eugeneyan) April 6, 2025
Moreover, AI models have information cutoffs. This implies they can’t reply queries depending on the most recent real-time data until retrieved dynamically, which requires utilizing RAG.
Furthermore, if somebody plans to run Llama 4 on inference suppliers like Groq or Collectively AI, these companies provide a context restrict considerably decrease than 10 million. Groq provides approximately 130,000 tokens for each the Llama 4 Scout and Maverick. Collectively AI offers about 300,000 tokens for the Llama 4 Scout and approximately 520,000 tokens for the Llama 4 Maverick.
LLMs Carry out Poorly Past 32,000 Tokens
Furthermore, a study revealed that after 30,000 tokens in context, LLMs exhibited a decline in efficiency. Though it didn’t embody the Llama 4 model, the research indicated that at 32k tokens, 10 out of 12 examined AI models carried out beneath half their short-context baseline. Even OpenAI’s GPT-4o, one of many high performers, dropped from a baseline rating of 99.3% to 69.7%.
“Our evaluation suggests these declines stem from the elevated problem the eye mechanism faces in longer contexts when literal matches are absent, making it tougher to retrieve related data,” learn the research.
The review additionally famous that conflicting data inside the context can confuse the AI model, making it vital to use a filtering step to take away irrelevant or deceptive content material. “That’s normally not an issue with RAG, but when we indiscriminately put every little thing within the context, we’ll additionally want a filtering step,” said D’Alia, who cited the above study to again his arguments.
All issues thought-about, Meta’s Llama 4 is certainly an enormous step ahead in open supply AI.
Artificial Analysis, a platform that evaluates AI models, mentioned that the Llama 4 Maverick beats the Claude 3.7 Sonnet however trails the DeepSeek-V3 whereas being extra environment friendly. However, the Llama 4 Scout provides efficiency parity with the GPT-4o mini.
On the MMLU-Professional benchmark, which evaluates LLMs on reasoning-focused questions, the Llama 4 Maverick scored 80%, matching the Claude 3.7 Sonnet (80%) and OpenAI’s o3-mini (79%).
On the GPQA Diamond benchmark, which exams AI models on graduate-level science questions, the Llama 4 Maverick scored 60%, decrease than Gemini 2.0 Flash (60%) and DeepSeek V3 (66%).
