Tests reveal that chatgpt-5 hallucinates smaller than gpt-4o do-and-hell is still the king to make things up

Chatgpt 5 scores a low 1.4% on Hallucination Leaderboard
This puts it in front of chatgpt-4 that scores 1.8% and GPT-4o that scores 1.49%
GROK 4 is much higher at 4.8% with Gemini-2.5 Pro is at 2.6%

When Openai launched Chatgpt-5 on Thursday last week, if the major sales outlets that CEO Sam Altman emphasized was that Chatgpt-5 was the most “powerful, smart, fastest, reliable and robust version of Chatgpt that we have ever sent”, and in the presentation Openai also emphasized that chatgpt-5 would “diminish hallucinations”.

When AI does something up, it is called a hallucination, and while the hallucination rate falls among all LLMs, it is still surprisingly common and one of the main reasons we cannot trust AI to perform a task without human supervision.

Vectara, Rag-As-A-Service and AI Agent platform, which operates the industry’s top hallucination lists for foundation and reasoning models, have made Openai’s claims to the test and found that it actually ranks lower for hallucinations than Chatgpt 4, but is only slightly lower than chat-4o (only 0.09% lower, in fact).

According to Vectara, Chatgpt-5 has a grounded hallucination rate of 1.4% compared to 1.8% for GPT-4 and 1.69% for GPT-4 Turbo and 4o Mini, with 1.49% for GPT-4O.

Spicy GROK

Interestingly, the chatgpt-5-hallucination rate came slightly higher than the chatgpt-4.5-preview mode, which scored 1.2%, but it also scored much higher than Openai’s O3-MINI HIGH REPROUND MODEL, which was the best advancing GPT model, with a grounded hallucination rate of 0.795%.

The results of the Vectra tests can be seen at Hughes Hallucination Evaluation Model (HHEM) Leaderboard, which hosts the hug face, saying that “for an LLM is its hallucination rate defined as the ratio of summaries that are hallucinated to the total number of summary it generates”.

Chatgpt-5, however, hallucinates much less than its competition, but with the Gemini-2.5 Pro comes in 2.6% and the GROK-4 is much higher at 4.8%.

XAI, the manufacturers of Grok recently received a lot of criticism for his new “spicy” mode in the Grok Imagine, an AI video generator who seems to be happy to create deep -pitching topless videos of celebrities such as Taylor Swift, although nudity had not been requested, and the system must include filters and moderation to prevent actual nudity or something sexual.

GROK IMAGINE is accused of consciousness of creating sexually explicit deepfaks of Taylor Swift. (Image Credit: Neilson Barnard/Getty Images)

‘I lost my best friend’

Openai was facing an almost immediate setback as it removed Chatgpt 4, and all its variations such as GPT-4O and 4O-mini, from its plus accounts with the introduction of chatgpt-5. Many users were incensited that Openai did not warn that the older models were removed, with some Reddit users saying they had “lost their only friend overnight”.

It seems that chatgpt-5 has replaced one of the most reliable versions of chatgpt (version 4.5) from the hallucination perspective.

Sam Altman quickly posted on X, “We definitely underestimated how much some of the things people like in GPT-4o matter to them, even though GPT-5 performs better in most ways,” and promised to bring chatgpt-4o back for plus users for a limited period of time, “and say,” We will see the use when we think about how long to offer Legacy Models.

Must Read

Leave a Comment Cancel Reply