‘Current LLMs introduce significant errors when editing working documents’: Microsoft researchers find most AI models struggle with long-running tasks – so maybe you don’t fully trust them yet

Microsoft researchers determine that current LLMs are not good for long-term assignments
More interactions and less structure significantly reduce benchmark performance
“Python is the only domain where most models are ready”

New research by a trio of Microsoft employees has revealed a fundamental problem that could block effective agent AI — namely that most AI models can’t actually reliably handle long-running workflows.

To quantify their results, the researchers introduced a new DELEGATE-52 benchmark to provide metrics across 52 sectors, including coding, accounting, science and more.

Ultimately, the paper concluded that current LLMs “introduce sparse but serious errors that silently corrupt documents and are amplified over prolonged interaction.”

AI isn’t that good at long-running tasks yet

The study goes into some of the latest AI models, including Gemini 3.1 Pro, Claude 4.6 Opus and GPT-5.4. It found that even they “corrupt an average of 25% of document content at the end of long runs,” with smaller models even more likely to go wrong.

The DELEGATE-52 benchmark uses real documents with a length of about 15,000 tokens and introduced 5-10 complex editing tasks with a “round-trip relay simulation” that asks the AI to perform a transformation and then reverse it. This allows the researchers to measure how effectively each model reconstructs the documents back to their original forms.

Highly structured and programmatic domains were where the models performed best, with the Microsoft researchers concluding that “Python is the only domain where most models are ready.” Conversely, natural language workflows, creative areas, and semi-structured documents saw modelers struggle.

The paper also reveals that the longer the token length, the more likely an AI model will struggle.

Where boundary models differed was not in their ability to eliminate errors—just that they were able to delay errors. Some of the other models tested by Microsoft researchers included a number of GPT-5 and GPT-4 generations, Claude options, Gemini models and one each from Mistral, xAI and Moonshot – a total of 19 different models from six families.

Gemini 3.1 Pro took first place with a DELEGATE-52 benchmark score of 80.9% after 20 interactions; Claude 4.6 Opus (73.1%) and GPT-5.4 (71.5%) round out the top three, and GPT 5 Nano (10.0%) falls into last place.

In short, the paper concludes that today’s AI models are not reliable enough to be trusted for long-term, autonomous workflows, highlighting key areas for model developers to focus on in the future and offering another benchmark for determining model capability.

Via The register

Google logo on black background next to text reading 'Click to follow TechRadar'