Top AI coding assistants fail one in four tasks, revealing serious gaps between hype and actual performance reliability

The report finds that AI coding assistants fail one in four structured output tasks
Even advanced proprietary models only reach approximately 75% accuracy
Open source AI models fare worse, averaging closer to 65% reliability

The promise of artificial intelligence as a tireless coding assistant has hit a significant roadblock after new research claimed such tools can experience a number of problems.

A recent study from the University of Waterloo found that artificial intelligence struggles with software development, with even the most advanced models failing one in four structured output tasks.

The research evaluated 11 large language models across 18 different structured formats and 44 tasks to test how well the systems could follow predefined rules, and found a clear difference between performance on text-based tasks and output involving multimedia or complex structures.

The article continues below

Benchmarking reveals a worrying reliability gap

While text-related tasks were generally handled with moderate success, tasks requiring image, video or website generation proved far more problematic.

Accuracy in these areas dropped sharply, raising questions about how these AI tools can be safely integrated into professional workflows.

“With this kind of study, we want to measure not only the syntax of the code — that is, whether it follows the set rules — but also whether the outputs produced for different tasks were accurate,” said Dongfu Jiang, a doctoral student and co-first author of the study.

Structured output, designed to enforce format consistency through JSON, XML or Markdown, was intended to make AI responses more reliable for developers.

AI companies including OpenAI, Google and Anthropic introduced structured output to force responses into predictable formats.

The Waterloo research suggests that this approach has yet to deliver the level of reliability that developers demand.

Waterloo’s benchmarking revealed that even the most advanced proprietary models only reached around 75% accuracy, while open source alternatives performed closer to 65%.

These results suggest that despite improvements, AI systems still make significant errors that cannot be ignored in professional development environments.

The report emphasized the need for human oversight, noting, “Developers can have these agents working for them, but they still need significant human oversight.”

Although structured output is a step forward from free-form natural language responses, errors are still common.

The technology is not yet robust enough to function independently in complex development scenarios.

One can reasonably question whether the industry’s enthusiasm for AI and vibe coding assistants has outstripped the actual capabilities of the underlying technology.

Even the most advanced models show a significant failure rate on structured tasks, revealing a wide gap between marketing claims and actual performance.

Therefore, for now, developers should treat these tools as experimental aids rather than standalone counterparts.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews and opinions in your feeds. Be sure to click the Follow button!

And of course you can too follow TechRadar on TikTok for news, reviews, video unboxings, and get regular updates from us on WhatsApp also.

Must Read

Leave a Comment Cancel Reply