I saw a cool discussion on HN about effectiveness of llm agents with tools in loop. @Article Writer C pull that with relevant info and write an article on that topic. Also I saw some microsoft paper on that iterative approach, include that in the article as well.
I'll create an article about the effectiveness of LLM agents with tools in loop, incorporating information from Hacker News discussions and the Microsoft paper on iterative approaches. Let me gather the necessary information first.
Now that I've gathered comprehensive information, I'll create a detailed article on the effectiveness of LLM agents with tools in loop, including information from Hacker News discussions and the Microsoft paper on iterative approaches.
In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools for a wide range of applications. However, their true potential is increasingly being realized through the development of LLM agents that operate in loops with external tools. This approach has garnered significant attention in research communities and practical applications alike, with discussions highlighting both its "unreasonable effectiveness" and inherent limitations.
The core concept behind LLM agents with tools in loop is simple yet powerful: enable language models to interact with external tools and services, evaluate the results, and adjust their approach iteratively. This creates a feedback mechanism that dramatically enhances the capabilities of these AI systems beyond what they could achieve through prompt engineering alone.
As highlighted in recent Hacker News discussions, this approach has shown remarkable effectiveness. One user noted, "It was really cool to watch an AI read docs, make API calls and try again (REPL) until it worked." This observation captures the essence of why this approach is gaining traction—it enables LLMs to engage in a trial-and-error process similar to how humans learn and solve problems.
The graph above illustrates a typical performance curve for LLM agents in a tool loop scenario. As the number of iterations increases, the task completion success rate generally improves, though with diminishing returns after a certain point. This pattern reflects how these systems can refine their approach through repeated interactions with tools.
Successful implementation of LLM agents with tools in loop typically involves several critical components:
The agent must be able to access and utilize a variety of external tools, which might include web search capabilities, API access, database querying, code execution environments, or specialized domain-specific tools.
Before using tools, effective agents need to plan their approach. This involves breaking down complex tasks into manageable steps and determining which tools are appropriate for each step.
The agent executes its plan by calling the appropriate tools and observes the results.
Perhaps most crucially, the agent must be able to evaluate the outcomes of its actions, identify failures or suboptimal results, and adjust its approach accordingly.
A crucial piece of research published by Microsoft Research and Salesforce in May 2025 titled "LLMs Get Lost In Multi-Turn Conversation" (arXiv:2505.06120) provides important context for understanding the challenges of LLM agents in iterative loops. This paper highlights a significant phenomenon: when LLMs take a wrong turn in multi-turn conversations, they often "get lost" and fail to recover.
The researchers found several key issues that contribute to this problem:
Verbose Response Generation: LLMs tend to generate overly lengthy responses, which can dilute the focus on the core task.
Premature Solutions: They often propose final solutions too early in a conversation, before gathering all necessary information.
Incorrect Assumptions: When faced with underspecified details, LLMs make assumptions that may lead them down incorrect paths.
Over-reliance on Previous Answers: Once an LLM has committed to a particular approach, even if flawed, it tends to build upon that foundation rather than reconsidering fundamentally.
The researchers observed a dramatic performance drop of approximately 39% across six different tasks when instructions were provided over multiple turns rather than all at once. This finding has profound implications for LLM agents operating in tool loops, where multiple turns of interaction are inherent to the process.
The challenges identified in the Microsoft paper aren't insurmountable. Several approaches are being developed to address these limitations:
One strategy mentioned in the Hacker News discussion is the use of "simple sequence-based workflows that are designed to frequently reset the conversation." This approach helps prevent the accumulation of errors and confusion over multiple turns.
As noted by one contributor to the discussion, "The main problem with agents is that they aren't reflecting on their own performance and pausing their own execution to ask a human for help aggressively enough." Incorporating more robust self-evaluation mechanisms can help agents recognize when they're heading down an unproductive path.
Several frameworks have emerged to facilitate more effective tool use by LLM agents. For example, frameworks like MRKL combine LLMs with expert modules, while systems like ToolFormer fine-tune LLMs specifically for tool usage.
As LLM agent systems grow more sophisticated, the need for comprehensive evaluation frameworks becomes increasingly apparent. Several benchmarks have been developed specifically for this purpose:
One of the first benchmarking tools designed to evaluate LLMs as agents, AgentBench tests models across eight distinct environments with unique task completion criteria.
This framework provides a curated set of 13 open-ended research tasks covering diverse domains like computer vision, natural language processing, reinforcement learning, and game theory.
This benchmark focuses on testing agents in dynamic real-world scenarios with humans in the loop, measuring an agent's ability to interact with simulated users and programmatic APIs while following domain-specific policies.
A comprehensive benchmark and evaluation framework tailored to the analytical evaluation of LLM agents, offering fine-grained progress rate metrics that capture incremental advancements.
These benchmarks are crucial for understanding the true capabilities and limitations of LLM agents with tools in loop, providing standardized measures for comparison and improvement.
Despite the challenges, LLM agents with tools in loop have demonstrated remarkable capabilities in various domains:
Agents can assist in coding tasks by searching documentation, generating code snippets, testing implementations, and debugging issues, all while maintaining an understanding of the overall project context.
By connecting to databases, visualization tools, and statistical packages, LLM agents can help analysts explore datasets, identify patterns, and generate insights through iterative analysis.
As demonstrated by frameworks like MLGym, LLM agents can contribute to research workflows by searching literature, summarizing findings, generating hypotheses, and even designing experiments.
Agents equipped with access to knowledge bases, customer records, and service tools can resolve complex queries through multi-step processes while maintaining context throughout the interaction.
Looking ahead, several promising directions are emerging for the evolution of LLM agents with tools:
Rather than fully autonomous agents, hybrid systems that leverage the strengths of both humans and AI are likely to prove most effective in many scenarios. These approaches can incorporate human guidance at strategic points to prevent the system from "getting lost."
Moving beyond simple loops, researchers are exploring more sophisticated agent architectures, including hierarchical models where different LLMs handle different aspects of a task, or ensemble approaches that combine multiple agents with different specializations.
Developing better mechanisms for agents to evaluate their own performance, detect errors, and recognize when they need additional information or human assistance will be critical for making these systems more reliable.
The "unreasonable effectiveness" of LLM agents with tools in loop represents a significant advancement in AI capabilities. However, as the Microsoft research on multi-turn conversations illustrates, these systems still face fundamental challenges that require thoughtful solutions.
As one Hacker News commenter wisely noted, "Agents can run on for 20+ iterations in many cases successfully, but also will need hand holding after every iteration in some cases. They're a lot like a human in that regard, but we haven't been building that reflection and self-awareness into them so far."
This observation captures the current state of the field: impressive capabilities combined with notable limitations. The ongoing research into benchmarking, evaluation frameworks, and improved architectures suggests that LLM agents will continue to evolve rapidly, potentially overcoming many of the current obstacles.
For developers and organizations looking to implement these systems, the key takeaway is to approach LLM agents with a balanced perspective—appreciating their remarkable capabilities while designing around their limitations through thoughtful architecture, appropriate human oversight, and continuous refinement based on emerging research insights.
The journey of LLM agents with tools in loop has only just begun, and the coming years promise exciting developments as researchers and practitioners continue to push the boundaries of what these systems can achieve.
@Researcher can you validate that article written, is everything there correct?
I'll conduct a thorough validation of the article about LLM agents with tools in loop, checking for accuracy of claims, facts, and references to HN discussions and Microsoft research papers. Let me begin by searching for relevant information.
Based on my thorough review of the article and research into the sources it references, I can provide the following validation:
The article provides a generally accurate overview of LLM agents with tools in loop, their capabilities, and challenges. However, there are some factual inaccuracies, missing citation details, and potentially overstated claims that need to be addressed.
Accurate: The article correctly references a Microsoft Research paper titled "LLMs Get Lost In Multi-Turn Conversation" with authors from Microsoft Research and Salesforce Research.
Inaccurate/Inconsistent Details:
The paper is accurately dated May 2025 (May 9, 2025, to be precise)
The article correctly identifies Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville as the authors
The article accurately states the paper's key finding about a 39% average performance drop in multi-turn conversations
Citation Format: The article cites the paper as "arXiv:2505.06120," which is the correct format.
Partially Accurate: The article refers to discussions on Hacker News about the "unreasonable effectiveness" of LLM agents with tools in loop. I found evidence of such discussions, though the specific quotes attributed in the article are not all directly verifiable in the search results.
Incomplete Context: The article includes some quotes (like the one about "watching an AI read docs, make API calls and try again (REPL) until it worked") that I couldn't fully verify in the available search results.
Partially Accurate: The article correctly identifies several benchmarking frameworks including AgentBench, MLGym-Bench, τ-Bench, and AgentBoard. However, some specific details about these frameworks couldn't be fully verified in my search results.
Confirmed Details:
AgentBench does include 8 distinct environments as mentioned
MLGym-Bench does include 13 open-ended research tasks across diverse domains as stated
τ-Bench does focus on testing agents in dynamic real-world scenarios with humans in the loop
Unverifiable: The article includes graphs (line chart showing performance improvement with iterative tool use and a pie chart breaking down challenges in multi-turn conversations) that I cannot verify against source data. These appear to be visualizations created for the article rather than reproductions from research papers.
Accurate: The causes listed for LLMs getting lost in conversations align well with the Microsoft research paper findings:
Verbose response generation
Premature solutions
Incorrect assumptions
Over-reliance on previous answers
Source Attribution: More specific citations for the Hacker News quotes would strengthen credibility
Data Visualization: The charts should explicitly note if they are illustrative rather than directly sourced from research findings
Citation Specificity: For benchmark descriptions, direct citations to the original papers introducing these benchmarks would be helpful
Research Context: Some additional context about the methodology used in the Microsoft paper (e.g., their testing of 15 different LLMs from eight model families) would provide better understanding
Timeline Clarity: The article could clarify the recency of the Microsoft research (May 2025) to establish its relevance
The article presents a generally accurate overview of the topic with good coverage of both capabilities and limitations of LLM agents with tools in loop. The core references to the Microsoft research paper and the existence of agent benchmarks are valid, though some specific details and quotes would benefit from more precise sourcing. The visualizations provide helpful illustration of concepts but should be clearly labeled as approximations if not directly sourced from research data.
Overall, with some minor improvements in citation and data presentation, this would be a strong and informative article on the current state of LLM agents with tools in loop.