Article

Human error, AI error… what if the real bug lies somewhere in between?

Q: Are multi-agent systems ready for production?

Technically, yes: agent orchestration frameworks are proliferating and working. The real question is the maturity of the surrounding infrastructure: traceability of agent interactions, state synchronization, interface contracts, and human escalation points. A prudent rule is to start with scopes that have a low blast radius (internal, reversible tasks with a limited audience) before expanding the scope of action.

The day it took just 45 minutes to bring down a Wall Street giant

On August 1, 2012, as Wall Street opened, Knight Capital’s servers (at the time one of the largest U.S. market makers) began placing millions of erratic orders across nearly 150 securities. By the time the teams identified and cut off the source of the problem, 45 minutes had passed and the bill had risen to approximately $440 million. The company, on the brink of bankruptcy, was acquired a few months later.

A failure with no single culprit:

The cause? Not a market crash, not a hack, and not a rogue trader. During a deployment, an old test module that had been left in the code (a fragment called Power Peg, designed years earlier for simulation environments) was accidentally reactivated on one of the eight production servers. Seven servers were running the new code; the eighth was simulating test behavior against real markets, with real orders and real millions.

The most troubling aspect of this story is that it’s hard to pinpoint exactly where the mistake was made. Was it the technician who forgot to include a server during deployment? The test code that was never cleaned up? The lack of a post-deployment verification procedure? Or the system that failed to establish any safeguards between simulated behavior and a live market? Taken individually, each link in the chain committed only one unforgivable mistake. It was their combination and the absence of any mechanism to contain it that led to the disaster.

Fourteen years later, this question has evolved into one of artificial intelligence. We spend our time debating whether humans or AI make more mistakes, backed by polls and armed with benchmarks. But this comparison misses the point: the systems we build are no longer human or artificial. They are hybrid. AI co-pilots assist doctors, software agents exchange tasks without supervision, and entire decision-making chains combine models, interfaces, and operators. What we need to compare are not error rates, but the nature and causes of errors; but above all, we must understand the category that emerges at the interface:coordination errors—those that belong to no one but affect everyone. Let’s break down the four categories.

Human error: a matter of human nature, fatigue, and mental shortcuts

Cognitive psychology, particularly since the work of James Reason, distinguishes several well-defined types of human errors:

Mistakes (slips): The intention is good, but the execution goes awry. The nurse intends to administer 0.5 mg but dispenses 5 mg at the end of a 12-hour shift. The developer intends to deploy in the staging environment but pushes to production.
Omissions (lapses): a step is forgotten. We forget one out of every eight servers during a deployment, a bandage at the end of a procedure, or a debug flag before going live.
Omissions: We fail to do what we should have done, often because nothing in the interface or process indicated that it needed to be done.
Errors in judgment: the situation is correctly perceived but misjudged. A risk is underestimated, a margin is overestimated, or one tells oneself, “It’ll be fine.”
Biased decisions: confirmation bias leads a decision-maker to disregard data that contradicts their hypothesis; anchoring bias causes them to stick to the first estimate they hear.

A gradual and context-dependent error:

The causes, however, are deeply biological and contextual: fatigue, stress, emotion, distraction, time pressure, and above all, the limits of our working memory (approximately four items that can be processed simultaneously, according to estimates from cognitive science research). Humans are not defective machines: we are a system optimized by evolution for cognitive frugality, one that takes shortcuts (the famous heuristics) that are remarkably effective 99% of the time… and disastrous the rest of the time.

Key point: Human error is gradual and context-dependent. A tired person makes slightly more mistakes, while a well-rested person makes slightly fewer. And people generally sense when they are in a state of uncertainty—which, as we will see, is not at all the case for their artificial counterparts.

The AI's Mistake Confident, Fiery, and Statistical

The AI AI produce errors errors of a kind kind kind:

Misclassification: the system categorizes an entry incorrectly. In 2020, Robert Williams was arrested in Detroit based on facial recognition that mistakenly identified him as a suspect; systems at the time had significantly higher error rates for faces that were underrepresented in the training data.
The hallucination: the model generates plausible but false information with complete confidence. In 2023, in the case of Mata v. Avianca, New York lawyers filed a brief citing case law entirely fabricated by ChatGPT (including rulings, case numbers, and citations).
Overfitting: the model has memorized its training data instead of identifying generalizable patterns. Excellent in the lab, mediocre in the real world.
Fragility: even a minor change (a few altered pixels, a slightly different wording) can completely overturn the prediction.
The failure of the distribution shift: the world is changing, but the model isn’t. Zillow learned this the hard way in 2021: its real estate valuation model continued to buy homes at prices out of step with a market that was turning. The cost of this venture: over $500 million in write-downs and the shutdown of the Zillow Offers business.

The causes are also specific: biased or incomplete data (the model cannot learn what it has not been shown), poorly defined objectives (optimizing clicks is not the same as optimizing satisfaction), lack of context (the model doesn’t know what it doesn’t know), and lack of metacognition: a classifier can display 99% confidence in an absurd answer.

This is the fundamental difference from human error: whereas humans gradually lose their edge and begin to doubt themselves, AI fails abruptly and without any warning signs. A tired radiologist hesitates; a model operating outside its training domain responds with the same confidence as it would within its area of expertise. An AI error is not simply a faster version of a human error: it is a different kind of error altogether.

Humans + AI: When a bug creeps into the interface

One might think that combining a vigilant human with a high-performance AI offers the best of both worlds. The literature on human factors paints a more nuanced picture: collaboration creates its own patterns of failure that neither humans nor machines would produce on their own.

Automation bias: When faced with an algorithmic recommendation, humans tend to set aside their own judgment. Studies on diagnostic support systems have shown that clinicians may follow an incorrect recommendation even though they would have made the correct diagnosis without assistance. AI does not replace doctors; if implemented poorly, it can actually make them less effective.
Overconfidence and underconfidence: the issue isn’t whether or not to trust, but how to calibrate that trust. Too much trust, and we validate delusions (the lawyers in the Avianca case). Too little, and we ignore relevant alerts, until one day, weary of false positives, we disable the system (the well-known phenomenon of alert fatigue).
Misinterpreting the results: Is a score of 0.87 a probability? A similarity index? Users often ascribe a meaning to the number that the model does not intend.
The opacity of reasoning: it is impossible to challenge a recommendation when you do not understand the logic behind it. Faced with a black box, a bank advisor or doctor has only two options: to obey or to ignore it. Neither of these constitutes collaboration.
Ambiguous handoffs and gaps in accountability: who is responsible when the system does “almost” everything on its own? Accidents involving driver-assistance systems illustrate this gray area: the manufacturer claims the driver was supposed to remain alert, while the driver believed the machine was in control. In between, no one was really at the wheel.

Update in common of these failures : none of come from a defective component. The model may be precise, the human competent, but the mediocre mediocre. Recent recent on complementarity human-AI show in fact a result counterintuitive : in certainconfigurations, the human + AI performs less well than AI on its own, because because the collaboration layer is poorly designed. The interaction is a emergent of of system, not the sum of errors of its parts. Exactly just at Knight Capital: the deploymentwas was only the trigger ; it’s the interface between the human human and the automated automated that provided the scope of damage.

AI + AI: the coordination error, or the bug with no one to blame

That leaves the most recent frontier: multi-agent systems, where several specialized AIs exchange tasks, summaries, and decisions: one agent reads emails, another queries the database, and a third drafts the response. This is the architecture toward which modern automation workflows are converging, and it gives rise to a new kind of failure:

Semantic drift : Agent A summarizes a document with a slight approximation; Agent B treats this summary as an established fact; Agent C bases a decision on it. Each link is “almost correct,” but the final result is wrong. It’s the game of telephone, API style.

The stale state : two agents are working on different versions of the same reality: one on this morning’s inventory, the other on last night’s. Each is correct within their own reference frame; the system is wrong.

Cascading errors : an erroneous output becomes the input for ten downstream processes. The Flash Crash of May 6, 2010, remains the most spectacular illustration of this: trading algorithms reacting to the reactions of other algorithms wiped out approximately $1 trillion in market capitalization in a matter of minutes, before a rebound that was almost as rapid. No algorithm was “broken”; it was their feedback loop that went haywire.

Duplicate work and wasted effort : Without a clear contract, two agents handle the same ticket while a second ticket falls through a routing gap and is never processed. Anyone who has experienced a poorly managed corporate reorganization will recognize the pattern.

Ambiguous roles : Does the “verification officer” verify the facts, the format, or compliance? If every officer assumes that someone else will handle it, no one will.

The key point here is this: these failures are not model errors. You could replace each agent with a model twice as powerful, but the semantic drift and the stale state would remain. What is missing is thecoordination infrastructure : interface contracts, state synchronization, verification loops. In short, everything that distributed software engineering took thirty years to learn and that we are rediscovering—sometimes painfully—with agents that communicate in natural language, that is, in the most ambiguous format ever invented.

There is an additional aspect specific to automation: the speed at which it spreads. A human error spreads at the pace of humans: a misdirected email, a misplaced document, a few people affected before anyone notices. An error in an automated system spreads at the pace of machines: millions of orders in 45 minutes at Knight Capital, millions of users affected in a matter of seconds when a notification system goes haywire. Tomorrow, an AI agent will be able to execute perfectly its instruction in the wrong context (wrong permission, outdated state, production environment instead of a sandbox) and its action will reach its maximum audience before anyone has had time to blink. The danger isn’t always the agent’s intelligence: it’s the scope of action entrusted to it without safety boundaries, what reliability engineers call the blast radius.

The next task: designing the system, not just the model

If we accept this classification, one conclusion is clear: the quest for model accuracy, however necessary it may be, addresses only one of the four categories of errors. The other three fall under the system design. The task at hand looks like this:

Explicit roles : every actor (human or AI) knows what they are expected to produce, for whom, and what falls outside their scope.

Structured handoffs : A handoff conveys context, confidence levels, and assumptions—not just a raw result. The aviation industry codified its control transfer checklists after decades of accidents; agent systems will need to do the same.

Uncertainty reporting : a model that says “I don’t know” or “low confidence, single source” is infinitely better than a model that is slightly more precise but still definitive.

Audit trails : when the final result is incorrect, we need to be able to trace it back up the chain and identify the link where the information became corrupted. Without traceability, there is no diagnosis; without diagnosis, there is no improvement.

State synchronization : a shared, time-stamped, versioned source of truth—the ABCs of data engineering—which is now a security requirement.

Controlling the scope of operations : strict separation of environments, permissions granted only as needed, phased deployments, and confirmation commensurate with the impact of the action. Everything that turns a potentially major incident into a non-event as seen by three internal testers.

Human intervention points : not a human who approves everything (which would defeat the purpose of automation), but defined thresholds (critical issues, low confidence, out-of-distribution situations) where the system must escalate.

Verification loops : A monitoring agent is only useful if its monitoring role is clearly defined, equipped with the necessary tools, and independent of the agent being monitored.

None of this constitutes fundamental AI research. It involves systems engineering, usability, and governance. It may be less spectacular than a new foundational model, but this is likely where the next gains in reliability will be made.

Conclusion

What if the right question were no longer “Who makes more mistakes, humans or machines?” but “Who fills the gap between the two?” Humans will remain prone to fatigue, distraction, and bias. That is the price of their frugal intelligence. AI will remain statistical, confident, and blind to its own limitations. That is the price of its power. Neither will ever be “corrected.” However, the fabric that connects them—interfaces, contracts, protocols, environments, institutions—is entirely within our control.

Tomorrow’s reliability will therefore not come from isolated intelligence—whether human or artificial—but from a well-designed socio-technical system in which humans, models, interfaces, and organizations monitor and correct one another. Knight Capital did not fail because a human forgot about a server, nor because a program made a miscalculation: the company failed because nothing in between was designed to correct the error. At a time when we are entrusting AI agents with ever-broader scopes of action, this is precisely the lesson that would be costly to learn a second time.

(Note to project managers: No, “it was the AI agent that went off track semantically” won’t be any more of an excuse than “it was the intern’s fault.” In both cases, the real question will be: Who designed the handoff? Sorry.)

FAQ

Does AI make fewer mistakes than humans?

That’s not the right question to ask. On narrow, stable tasks, a well-trained model can achieve a lower error rate than a human expert. But the errors are not of the same nature: humans gradually deteriorate and experience doubt, while AI fails abruptly and with certainty. Comparing the rates without comparing the nature of the errors is like comparing the frequency of colds to that of power outages.

What exactly is a coordination error?

This is a failure that cannot be attributed to any single component: every person, every model, and every service functioned “correctly” within its own scope, but their interaction produced an incorrect or harmful result. Semantic drift between agents, outdated shared state, context-less handoffs, diluted accountability: the bug lies in the interactions, not in the individual components.

How can you reduce automation bias within a team?

Three approaches work well in practice: requiring the system to disclose its confidence level and sources rather than providing a definitive answer; training users on the cases where the system makes mistakes (not just those where it excels); and conducting regular unassisted exercises to maintain independent competence. A human who no longer exercises their judgment loses it and can no longer serve as a safety net.

Are multi-agent systems ready for production?

Technically, yes: the agent orchestration frameworks for agent orchestration are proliferating and working. The real question is the maturity of the surrounding infrastructure: traceability of agent interactions, state synchronization, interface contracts, and human escalation points. A prudent rule is to start with scopes that have a low blast radius (internal, reversible tasks with a limited audience) before expanding the scope of action.

What is "blast radius" and how can it be limited?

The blast radius refers to the maximum extent of damage that a failed action can cause before it is contained. It is limited by well-known software engineering mechanisms: strict separation of environments (test, acceptance, production), minimal permissions, gradual deployments (canary releases), action limits (amounts, volumes, audience), and emergency kill switches. The principle is simple: assess the potential consequences of an error before asking whether it will occur.