essay

Multi-Agent Harness Engineering and Design

A frame for applying harness engineering principles to multi-agent systems: state, tools, memory, handoffs, evaluation, and maintenance as models improve.

Back to index

Harness engineering is already becoming a standard way to talk about useful agent systems. The model is placed inside an environment with tools, context, state, permissions, feedback, and evaluation. Coding agents made this visible because the surrounding harness is hard to ignore: file access, search, terminal execution, patch application, tests, diffs, logs, and review loops.

Multi-agent systems need the same discipline, with extra surfaces. The harness now has to manage more than one model loop. It has to manage division of work, shared state, handoffs, tool boundaries, memory scopes, evidence, and coordination failure.

This overlaps with amplification architecture because both topics ask when multiple agents add real capability. The difference is emphasis. Amplification architecture is the design intent. Multi-agent harness engineering is the execution layer that makes the intent inspectable.

What the harness controls

Anthropic’s Building effective agents separates workflows from agents: workflows follow predefined code paths, while agents dynamically direct their own tool use and process. OpenAI’s agent guide describes runs as loops with exit conditions, then places multi-agent systems into manager and decentralized handoff patterns.

Those patterns are useful, but the harness has to make them operational. In a single-agent system, the harness controls the loop, tool calls, context, guardrails, and output. In a multi-agent system, it also controls:

  • which agent owns which part of the task
  • which state is local, shared, or durable
  • which tools each agent can call
  • when one agent can hand off to another
  • what artifact must be produced before the next step
  • which memories can be read or written
  • which evidence is enough to end the run

This is where many role-based systems stay too shallow. A planner, implementer, and reviewer are useful only if the harness gives them different operating conditions. The planner needs constraints and acceptance criteria. The implementer needs a target environment, tools, and patch permissions. The reviewer needs tests, diffs, source access, and a verdict schema. The handoff between them should be an artifact, not a vague message.

The multi-agent literature gives examples of this move. AutoGen treats agent behavior and conversation patterns as programmable. MetaGPT encodes standard operating procedures into prompt sequences and intermediate outputs. Voyager shows a different but related pattern: a long-running agent can preserve executable skills and use environment feedback to refine future behavior.

The shared point is that useful agent systems need more than prompts. They need stateful operating conditions.

Evals for the surrounding system

Model evals do not cover the whole system.

A multi-agent harness needs evals for coordination. Did the manager delegate to the right specialist? Did a handoff preserve enough context? Did two agents duplicate work because memory was poorly scoped? Did a reviewer produce independent evidence or only a second opinion? Did the system stop because the task was complete, because the budget expired, or because an agent produced a confident final answer?

It also needs evals for tools. OpenAI’s guide separates data tools, action tools, and orchestration tools. That distinction matters in multi-agent systems because tool access becomes part of role design. A source agent may need retrieval and citation tools. A writer may need drafts and style memory. A deployment agent may need logs and command execution. A finance agent may need strict action permissions.

The harness should test whether each tool contract is clear enough for the agent that uses it. Tool overload is a real failure mode. OpenAI’s guide notes that the issue is often tool similarity or overlap, not raw tool count. If two tools look interchangeable, the harness should fix the tool surface before adding another agent.

Memory also needs evals. The memory survey on LLM agents describes reading, writing, and reflection as core memory operations. In a multi-agent harness, those operations need scope checks. An agent that can record a local task observation should not automatically update organization memory. A reviewer that can store a verdict should preserve the evidence that produced it. A planner that reads stale project assumptions should surface uncertainty rather than continue silently.

The eval target is the surrounding system:

  • handoff completeness
  • evidence quality
  • memory precision
  • stale-memory rate
  • permission correctness
  • duplicate work
  • tool-selection errors
  • reviewer independence
  • completion-gate quality

These are practical measurements. They tell the builder whether the harness is helping the agents coordinate or only adding movement.

The useful direction

Harnesses age. Stronger models absorb some scaffolding, and that is healthy. A pattern that was necessary for one model generation may become overhead for the next.

The Bitter Lesson gives the broader warning: systems that depend too heavily on fixed human structure can lose to more general methods as computation and learning improve. In agent engineering, the implication is practical. The harness should be maintained as the model changes. Remove scaffolding when the model no longer needs it. Keep the parts that connect the model to the environment: state, tools, memory, permissions, evidence, and review.

Multi-agent harnesses should be especially adaptive. The builder should expect roles to merge, split, or disappear. A separate planning agent may become unnecessary when the model can plan inside the implementation loop. A separate source agent may remain useful because it owns citation provenance and retrieval logs. A separate reviewer may remain useful because it has different tools, test authority, and risk constraints.

The role has to earn its place in the harness. The system should be able to show why the role exists.

This gives a maintenance loop:

  1. Run the system on real tasks.
  2. Record state, handoffs, tool calls, memory reads, memory writes, and verdicts.
  3. Inspect repeated failures or repeated friction.
  4. Promote useful patterns into tools, memory, procedures, or agent contracts.
  5. Retire roles that no longer add evidence, context reach, or permission separation.
  6. Re-evaluate after model upgrades.

The best multi-agent systems will probably look less like static teams and more like maintained operating environments. Agents may change. The harness should preserve the parts that make work legible: what happened, why it happened, who touched which state, what evidence was produced, and what should be different next time.

References