AIs like ChatGPT fall apart in classic ‘Stroop’ psychological test – and this could stand in the way of achieving artificial general intelligence

A new study tasked AIs with tackling the ‘Stroop’ test
GPT and Claude performed very poorly compared to humans
There are nuances here, but in general, researchers argue that improving this side of AIs is critical to achieving artificial general intelligence

A recently published study has pointed out a limitation of big name AI models such as ChatGPT, albeit causing some controversy as the primary piece of research uses now outdated versions of these models – but there are nuances to that and that doesn’t make the findings irrelevant.

I’ll get into that map, but first let’s look at the study itself, which was highlighted on Reddit (‘New study reveals top AI models completely fail the classic ‘Stroop’ psychological attention test’) and published via Oxford University Press in the journal PNAS Nexus.

The research consists of testing the so-called ‘Stroop effect’ with GPT-4o and Claude 3.5 Sonnet. As mentioned, these are not the cutting edge versions of these AIs (Large Language Models or LLMs) – but they were at the time the initial study was conducted.

Analysis: another necessary step on the road to AGI?

An AI face in profile against a digital background.

(Image credit: Shutterstock / Ryzhi)

If you’ve scanned through the Reddit thread, you’ve undoubtedly noticed that, as mentioned at the beginning, there’s been a lot of flak fired at this study by commenters due to the use of outdated models by GPT and Claude.

In fact, these older LLMs are called “state of the art” at one point by the authors – and of course, as already mentioned, they were cutting edge when the main study was conducted. Still, this is unfortunate phrasing that should have been updated and adjusted now that the paper has just been published (after peer review and so on).

However, the researchers did perform tests on GPT-5, Claude Opus 4.1 and Gemini 2.5 Pro in September 2025, although this is somewhat buried in the paper. The more recent testing found that these models offered only “minor” improvements over their predecessors, and that they still exhibited “ongoing managerial attention deficits, consistent with our extensive analysis of previous transformer models” (as did the Gemini 2.5 Pro, which was a new introduction here).

Admittedly, a smaller sample size was used, but the researchers still claim that their study overall reflects a fundamental limitation which is “inherent to the architectural limitations of transformer-based LLMs”.

The authors note that one caveat is that in ‘Thinking’ mode, the GPT-5 can write and then run code to ensure it performs the Stroop test flawlessly – and similar functionality can be used by other LLMs – but this is essentially the AI fuddling (cleverly) about its inadequacies. Of course, that doesn’t change the way it works or reasons more broadly.

The researchers note that transformer architecture innovations for LLMs are focused on improving memory capacities, which do not address “the core limitations of attentional mechanisms, specifically the need for sophisticated alerting, orienting, and executive control networks to enable cognitive flexibility.”

The ultimate goal is effective goal-directed behavior, and the study notes, “Future [LLM] development may benefit from implementing more sophisticated executive control systems that can handle decisional conflicts through structured, goal-directed processing rather than relying solely on enhanced memory capacities.”

The authors argue that “incorporating executive control mechanisms similar to those in biological attention is critical to achieving artificial general intelligence [AGI].”

Google logo on black background next to text reading 'Click to follow TechRadar'