Grok 4 benchmarks leak with 45% score on Humanity Last Exam

Grok 4 will be SOTA, according to the leaked benchmarks; 35% on HLE, 45% with reasoning; 87-88% on GPQA; 72-75% on SWE Bench (for Grok 4 Code)

Alexey Shabanov

6 Jul 2025 · 1 min read

xAI’s upcoming Grok 4 model, which had been hinted at for release after July 4th, still hasn’t appeared despite mounting anticipation. References to Grok 4 have surfaced in the xAI console, pointing to internal versions dated June 29th and July 2nd; these are likely incremental builds rather than official release dates. Meanwhile, newly uncovered documentation for xAI now mentions Grok 4 benchmarks, raising the question of whether these are genuine results or placeholders. If real, Grok 4 could set a new standard for large language models, especially given reported benchmark leaps such as a 35% score (and 45% with extra compute) on the Humanity last-exam benchmark, far ahead of o3 Pro’s previous 26% top score. These numbers would suggest Grok 4 could outperform leaders like Gemini 2.5 Pro, o3 Pro, and Claude 4 Opus.

Grok 4 early benchmarks in comparison to other models.

Humanity last exam diff is 🔥

Visualised by @marczierer https://t.co/DiJLwCKuvH pic.twitter.com/cUzN7gnSJX
— TestingCatalog News 🗞 (@testingcatalog) July 4, 2025

The main group that stands to benefit will be power users and developers already leveraging Grok in the xAI platform, as well as organizations seeking SOTA model performance. The new features and performance improvements would likely surface in the xAI developer console and API, potentially extending to consumer products if the rollout aligns with prior launches.

The urgency for xAI is evident: Elon Musk previously suggested a post-July 4th launch, and with OpenAI’s rumoured GPT-5 and fresh releases expected soon from Google and Anthropic, xAI faces competitive pressure to land Grok 4 before the market shifts again. While the exact release timing is still up in the air, all signs point to a high probability of Grok 4 becoming available in the coming week.

The company’s strategy has centred on rapid, visible progress, referencing Grok model updates, pushing SOTA benchmarks, and responding to the competitive tempo set by other AI labs. If Grok 4’s benchmark claims hold up, it would reinforce xAI’s bid to be seen as a true frontier lab, but everything now hinges on the actual launch.