Llama 4 Launch Exposed for Manipulating AI Benchmark Scores

Meta’s Llama 4 Launch Exposed for Manipulating AI Benchmark Scores

Meta’s surprise weekend launch of its new Llama 4 AI models has quickly become a case study in the growing tensions between AI marketing claims and real-world performance.

The company released two new models—Scout and Maverick—on Saturday, positioning them as serious challengers to industry leaders like OpenAI’s GPT-4o and Google’s Gemini models.

Shortly after release, Maverick secured the second-place position on LMArena, a respected benchmark site where humans compare outputs from different AI systems. Meta proudly highlighted Maverick’s impressive ELO score of 1417, placing it above OpenAI‘s GPT-4o and just below Google’s Gemini 2.5 Pro.

However, this achievement quickly unraveled when AI researchers discovered fine print in Meta’s documentation revealing that the version tested on LMArena wasn’t the same as what’s available to the public. Meta had deployed an “experimental chat version” of Maverick specifically “optimized for conversationality” for benchmark testing.

“Meta’s interpretation of our policy did not match what we expect from model providers,” LMArena posted on X two days after the model’s release. The site has since updated its leaderboard policies to prevent similar situations in the future.

While not explicitly against LMArena’s rules, this approach undermines the value of benchmark rankings as indicators of real-world performance. As independent AI researcher Simon Willison told The Verge, “The model score that we got there is completely worthless to me. I can’t even use the model that they got a high score on.”

Meta’s Technical Architecture and Claims

Meta describes the new Llama 4 models as “natively multimodal,” built to handle both text and images using an “early fusion” technique. Both models use a mixture-of-experts (MoE) architecture as follows:

Maverick: 400 billion total parameters, with only 17 billion active at once across one of 128 experts

Scout: 109 billion total parameters, with only 17 billion active at once across one of 16 experts

This architecture allows the models to function with fewer computational resources since only portions of the neural network are active simultaneously (we know—it’s very technical).

Meta made particularly bold claims about Scout’s 10-million-token context window—a feature that would theoretically allow the model to process huge documents and maintain longer conversations. However, developers quickly found that using even a fraction of this capacity proved challenging due to memory limitations.

According to Willison’s testing, third-party services providing access to Scout limited its context to between 128,000 and 328,000 tokens. Meta’s own example notebook revealed that running a 1.4 million token context requires eight high-end Nvidia H100 GPUs—hardware that costs hundreds of thousands of dollars.

The Community’s Response to This

The AI community’s response to Llama 4 has been lukewarm at best. Developers have reported underwhelming performance, especially for coding tasks and software development. Some users noted that Llama 4 compares unfavorably to innovative competitors like DeepSeek.

When tested with a lengthy document of around 20,000 tokens, Scout produced what Willison described as “complete junk output,” which devolved into repetitive loops, raising questions about the practical usefulness of its massive context window.

Meta has also continued to market Llama 4 as “open source” despite licensing restrictions that prevent truly open use. In reality, users must sign in and accept license terms before downloading the models.

Furthermore, the weekend release timing caused a stir in the AI community. When questioned about this unusual schedule on Threads, Meta CEO Mark Zuckerberg simply replied, “That’s when it was ready.”

According to a report from The Information, Meta repeatedly delayed Llama 4’s launch due to the model failing to meet internal expectations. These expectations were awfully high following the successful release of an open-weight model from DeepSeek, a Chinese AI startup.

Llama 4’s Implications for AI Development

Some researchers suggest that the underwhelming performance of Llama 4 points to larger issues in AI development approaches.

On X, researcher Andriy Burkov argued that recent disappointing releases from both Meta and OpenAI “have shown that if you don’t train a model to reason with reinforcement learning, increasing its size no longer provides benefits.”

Many people, myself included, didn't try to build a product around a language model because during the time you would work on a business-specific dataset, a larger generalist model will be released that will be as good for your business tasks as your smaller specialized model.…
— Andriy Burkov (@burkov) April 6, 2025

This observation aligns with growing discussions about potential limitations in scaling up traditional AI model architectures without incorporating newer techniques, such as simulated reasoning or developing smaller, purpose-built models.

Despite current drawbacks, there remains optimism about future iterations in the Llama 4 family. Willison expressed hope for “a whole family of Llama 4 models at varying sizes,” particularly an improved smaller model that could run effectively on mobile phones.

No doubt that the Llama 4 release will serve as a lesson that benchmark scores and marketing claims should be approached with healthy skepticism until verified through independent, real-world testing.