
Meta got caught gaming AI benchmarks
Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a mid-size model that the company claims can beat GPT-4o and Gemini 2.0 Flash âacross a broad range of widely reported benchmarks.â
Maverick quickly secured the number-two spot on LMArena, the AI benchmark site where humans compare outputs from different systems and vote on the best one. In Metaâs press release, the company highlighted Maverickâs ELO score of 1417, which placed it above OpenAIâs 4o and just under Gemini 2.5 Pro. (A higher ELO score means the model wins more often in the arena when going head-to-head with competitors.)
The achievement seemed to position Metaâs open-weight Llama 4 as a serious challenger to the state-of-the-art, closed models from OpenAI, Anthropic, and Google. Then, AI researchers digging through Meta’s documentation discovered something unusual.
In fine print, Meta acknowledges that the version of Maverick tested on LMArena isn’t the same as whatâs available to the public. According to Meta’s own materials, it deployed an “experimental chat version” of Maverick to LMArena that was specifically “optimized for conversationality,â Te …
Read the full story at The Verge.







