- AI promises a huge revolution for developers, but is it just for the creation of code?
- Popular AI models from anthropic and Openai are not good at troubleshooting
- Microsoft’s researchers open their tools to facilitate research
Although generative AI is increasingly integrated into programming of workflows, new research from Microsoft reveals that large language models are still not entirely up to the bottom when it comes to troubleshooting.
The research suggests that even advanced models are still struggling with troubleshooting tasks that are pretty simple for experienced developers, highlighting the continued meaning of human programmers.
AI seems to have a solid case, but with Google now claiming that about 25% of the new code is AI-generated. Meta has also noticed the broad implementation of AI for coding.
AI is good at creating code but not for troubleshooting
The report explores how 11 Microsoft scientists tested nine AI models on SWE-Bench Lite-a popular Debugging-Benchmark. Claude 3.7 Sonnet offered the highest success rate with a far from perfect 48.4%. Openais O1 and O3-mini had lower success rates of 30.2% and 22.1% respectively.
“Even with debugging tools, our simple quick-based agent rarely resolves more than half of SWE-Bench Lite problems,” the researchers wrote, accused the suboptimal benefit of lack of data representing sequential decision-making behavior.
However, all hope is not lost. “We think training or fine -tuning LLMs can improve their interactive troubleshooting skills,” they added. The researchers intend to fine-tune an info-seeking model that specializes in collecting the necessary information to resolve errors, but in the meantime they promise to open source debug gym to make it easier for others to conduct similar research.
Debug-Gym is described as an “environment that allows code-repair funds to access tools for active information-seeking behavior.”
At the moment, artificial intelligence may not bring as much value to developers’ lives as AI companies suggest. “Most developers spend most of their time troubleshooting code,” the researchers wrote, indicating that even if they benefit from code generation, it may not save them so much time.