‘Current LLMs introduce significant errors when editing working documents’: Microsoft researchers find most AI models struggle with long-running tasks – so maybe you don’t fully trust them yet


  • Microsoft researchers determine that current LLMs are not good for long-term assignments
  • More interactions and less structure significantly reduce benchmark performance
  • “Python is the only domain where most models are ready”

New research by a trio of Microsoft employees has revealed a fundamental problem that could block effective agent AI — namely that most AI models can’t actually reliably handle long-running workflows.

To quantify their results, the researchers introduced a new DELEGATE-52 benchmark to provide metrics across 52 sectors, including coding, accounting, science and more.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top