Scaling AI Agent Systems: When More Agents Mean More Trouble

January 28, 2026

By Yubin Kim, Research Intern, and Xin Liu, Senior Research Scientist, Google Research

In the rapidly evolving landscape of artificial intelligence, AI agents—systems capable of reasoning, planning, and acting—are emerging as a powerful paradigm for real-world applications. From coding assistants to personal health coaches, the industry is shifting from single-shot question answering to sustained, multi-step interactions. However, this shift introduces a new layer of complexity. Unlike isolated predictions, agents must navigate sustained, multi-step interactions where a single error can cascade throughout a workflow. This shift compels us to look beyond standard accuracy and ask: How do we actually design these systems for optimal performance?

The Heuristic of “More Agents Is Better”

Practitioners often rely on heuristics, such as the assumption that “more agents are better,” believing that adding specialized agents will consistently improve results. For example, a recent study titled “More Agents Is All You Need” reported that Large Language Model (LLM) performance scales with agent count. Similarly, collaborative scaling research found that multi-agent collaboration “…often surpasses each individual through collective reasoning.” However, our new paper, “Towards a Science of Scaling Agent Systems,” challenges this assumption.

Through a large-scale controlled evaluation of 180 agent configurations, we derive the first quantitative scaling principles for AI agent systems. Our findings reveal that the “more agents” approach often hits a ceiling and can even degrade performance if not aligned with the specific properties of the task.

Defining “Agentic” Evaluation

To understand how agents scale, we first defined what makes a task “agentic.” Traditional static benchmarks measure a model’s knowledge but don’t capture the complexities of deployment. We argue that agentic tasks require three specific properties:

Sustained Multi-Step Interactions

Agents must engage in sustained, multi-step interactions with an external environment. For instance, a financial planning agent must navigate multiple steps, from data collection to investment recommendation, to provide a comprehensive plan.

Iterative Information Gathering Under Partial Observability

Agents often operate under partial observability, meaning they don’t have complete information about the environment. They must iteratively gather information to make decisions. For example, a web navigation agent must click through pages to find the required information.

Agents must refine their strategies based on feedback from the environment. For example, a tool-use agent must adjust its plan based on the success or failure of its actions.

Evaluating Canonical Architectures

We evaluated five canonical architectures across four diverse benchmarks:

Single-Agent System (SAS)

A solitary agent executing all reasoning and acting steps sequentially with a unified memory stream. This architecture is simple but may struggle with complex tasks due to its lack of parallelization.

Independent Agents

Multiple agents working in parallel on sub-tasks without communicating, aggregating results only at the end. This architecture offers maximal parallelization but may miss out on potential synergies between agents.

Centralized Agents

A “hub-and-spoke” model where a central orchestrator delegates tasks to workers and synthesizes their outputs. This architecture provides a balance between parallelization and coordination but can become a bottleneck if the orchestrator is overwhelmed.

Decentralized Agents

A peer-to-peer mesh where agents communicate directly with one another to share information and reach consensus. This architecture is highly flexible but can suffer from communication overhead and consensus delays.

Hybrid Agents

A combination of hierarchical oversight and peer-to-peer coordination to balance central control with flexible execution. This architecture aims to leverage the strengths of both centralized and decentralized approaches.

Results: The Myth of “More Agents”

To quantify the impact of model capabilities on agent performance, we evaluated our architectures across three leading model families: OpenAI GPT, Google Gemini, and Anthropic Claude. The results reveal a complex relationship between model capabilities and coordination strategy.

While performance generally trends upward with more capable models, multi-agent systems are not a universal solution. They can either significantly boost or unexpectedly degrade performance depending on the specific configuration. For instance, decentralized architectures performed exceptionally well on parallelizable tasks like web navigation but struggled on sequential tasks like financial reasoning.

Predictive Model for Optimal Architecture

To address this variability, we introduced a predictive model that identifies the optimal architecture for 87% of unseen tasks. The model considers task properties, model capabilities, and agent configurations to recommend the best approach. This tool empowers practitioners to make informed decisions about agent system design.

Pros and Cons of Multi-Agent Systems

Multi-agent systems offer several advantages, such as parallelization, fault tolerance, and specialized expertise. However, they also come with challenges like communication overhead, consensus delays, and the risk of cascading errors. Understanding these trade-offs is crucial for designing effective agent systems.

Conclusion

Our research provides a comprehensive guide to scaling AI agent systems. By challenging the heuristic of “more agents is better” and introducing quantitative scaling principles, we hope to empower practitioners to design more effective and efficient agent systems. As AI agents continue to evolve, we believe that a science of scaling will play a pivotal role in their success.

FAQ

Q: Why is scaling agent systems important?

A: Scaling agent systems is crucial for improving performance, efficiency, and robustness in real-world applications. It helps practitioners design systems that can handle complex, multi-step tasks effectively.

Q: What are the key properties of agentic tasks?

A: Agentic tasks require sustained multi-step interactions, iterative information gathering under partial observability, and adaptive strategy refinement based on environmental feedback.

Q: Which architecture is best for all tasks?

A: There is no one-size-fits-all architecture. The best architecture depends on the specific properties of the task, the capabilities of the model, and the desired trade-offs between parallelization, coordination, and communication overhead.

Q: How can practitioners use our predictive model?

A: Practitioners can use our predictive model to identify the optimal architecture for their specific task. The model considers task properties, model capabilities, and agent configurations to recommend the best approach.

Q: What are the challenges of multi-agent systems?

A: Multi-agent systems face challenges like communication overhead, consensus delays, and the risk of cascading errors. Understanding these trade-offs is crucial for designing effective agent systems.

In the rapidly advancing field of AI, understanding how to scale agent systems is more important than ever. By challenging conventional wisdom and introducing practical, data-driven insights, we hope to guide practitioners in designing more effective and efficient AI agent systems. Stay tuned for more updates on AI agent systems and other exciting developments in the world of artificial intelligence.

Post Views: 0