- That accuracy obtained by the top scoring AI in the world’s toughest benchmark as improved by 183% in just two weeks
- Chatgpt O3-MINI now scores up to 13% accuracy depending on capacity
- Openai Deep Research Outline Competition with 26.6% Accuracy Rule
The world’s toughest AI exam, the last exam of the humanity, was launched less than two weeks ago, and we have already seen a big leap in accuracy, with chatgpt O3-mini and now Openai’s deep reasoning on the top of the leaderboard.
AI -Benchmark created by experts from around the world contains some of the most difficult reasoning problems and questions that you know – it is so difficult that when I previously wrote about humanity’s last exam in the article that is linked above, I could not even understand one of the questions, so much less answer it.
At the time of writing for the last article, World Phenomenon Deepseek R1 sat at the top of the leaderboard with a 9.4% accuracy score when evaluated only on text (not multimodal). Now Openai’s O3-mini launched earlier this week scored 10.5% accuracy at the O3-MINI setting and 13% accuracy in O3-mini-high environments, which is more intelligent but takes longer to generate answer.
More impressive, however, is Openai’s new AI Agent Deep Research’s score on Benchmark, with the new tool that scored 26.6%, a huge 183% increase in result accuracy in less than 10 days. Now it is worth noting that Deep Research has search features that make comparisons a little unreasonable as the other AI models do not. The ability to search the web is useful for a test such as humanity’s last exam, as it includes some general knowledge -based questions.
That said, the accuracy of results from models that take humanity’s last exam results is improved steadily, and it makes you wonder how long we have to wait to see an AI model get close to ending the benchmarket . Realistically, AI should not be able to get close to some point soon, but I would not bet against it.
It looks like the latest Openai model is doing very well across many topics. My guess is that deep research especially helps with people, including medicine, classics and law. pic.twitter.com/x8ilmq1aqsFebruary 3, 2025
Better but 26.6% never got me any rate
Openai Deep Research is an incredibly impressive tool and I have been blown up by the examples that Openai showed off when it announced the AI agent. Deep research is able to work as your personal analyst and takes the time to conduct intense research and come up with reports and answers that would otherwise take people and hours to complete.
While a score of 26.6% on humanity’s last exam is seriously impressive, especially considering how far the benchmark’s leaderboard has come in just a few weeks, it is still a low score in absolute terms – no one would claim to have passed a test with something less than 50% in the real world.
Humanity’s last exam is an excellent benchmark and one that will prove to be invaluable as AI models develop, enabling us to measure how far they have come. How long should we wait to see an AI bypassing the 50% mark? And which model will be the first to do it?