AI Agents That Make Things Up Will Cost Your Business More Than You Think
I've spent years building automation tools for ISP operators. Network fault resolution. Compliance filing. Equipment provisioning. I've seen what happens when an automated system gives a confident, plausible, wrong answer at 2am. In network operations, a wrong answer doesn't waste time — it takes services down.
The AI agent industry has a hallucination problem. Not a "we're working on it" problem — a structural, mathematically inevitable problem. A 2025 proof confirmed that hallucinations cannot be fully eliminated under current LLM architectures. They're a consequence of how these systems work: next-token prediction optimized for plausibility, not truth.
The benchmark data is clear. Even the best models show meaningful error rates on controlled tests: Gemini-2.0-Flash at 0.7% on summarization, Claude at roughly 3%. On open-ended factual queries, rates climb dramatically. On domain-specific technical queries — exactly what you need for network operations — rates of 10-20% are common.
For a chatbot generating marketing copy, a 3% error rate is manageable. For an agent executing CLI commands on live network equipment, it is not.
THE AGENT HALLUCINATION PROBLEM IS DIFFERENT FROM THE CHATBOT PROBLEM
When a chatbot hallucinates, a human reads the output and catches the error. The cost is a wasted interaction. When an agent hallucinates and takes action based on that hallucination, the cost is the action itself.
An agent that misidentifies an OLT fault and executes the wrong CLI sequence doesn't just produce a wrong answer. It executes a command on live infrastructure. It potentially cascades a single device fault into a service outage for hundreds of subscribers.
This is the safety argument for agentic AI that the industry is not taking seriously enough. Agents that act autonomously need reliability guarantees that single-model inference cannot provide. The error rate acceptable for text generation is not acceptable for autonomous action.
CONSENSUS MODELING: THE ENGINEERING ANSWER
The engineering solution comes from distributed systems design — the domain I've spent my career in. When a single node gives you an answer you can't fully trust, you don't just accept it. You ask multiple independent nodes and take the answer that has consensus.
Applied to AI agents, this means running the same query against multiple independent model instances, comparing their outputs, and requiring agreement before taking action. A single model might hallucinate. The probability that two independent models produce the same hallucination is dramatically lower. Three independent models with a majority-vote requirement approaches the reliability threshold that production operations require.
Research from Amazon's UAF (Uncertainty-Aware Fusion) framework demonstrated an 8% accuracy improvement over single-model approaches by combining multiple LLMs weighted by their accuracy and self-assessment quality. Self-consistency checking — asking a model to verify its own answer — reduces hallucination rates by up to 65% in some benchmark configurations.
XSI LodeStone implements a multi-agent quorum architecture for exactly this reason. Actions affecting live operational systems require consensus across multiple sub-agents before execution. The cost is latency. The benefit is reliability. For operations where a wrong action means service disruption, that trade-off is not a close call.
THE PERFORMANCE-RELIABILITY TRADE-OFF IN PRACTICE
Consensus modeling is not free. Running three model instances takes roughly three times the compute and produces results in roughly 1.5-2x the time of single inference. On an on-premises GB10-based appliance running Nemotron, a single inference takes approximately 200-400ms for typical operational queries. A three-node consensus check takes 300-600ms.
For most operational automation tasks — fault diagnosis, compliance filing, network queries — that latency is completely acceptable. An operator previously spending 15 minutes diagnosing a fault at 2am is now waiting 500ms. The absolute latency is irrelevant next to the improvement.
The trade-off that actually matters is accuracy versus speed, not accuracy versus cost. On-premises inference has no per-query cost. You can run three model instances for the same dollar cost as running one. The on-premises model changes the optimization entirely: when inference is free at the margin, you optimize for reliability, not cost efficiency.
Cloud AI changes this calculus. At $3-$25 per million output tokens, running three consensus checks costs three times as much as running one. For a business processing thousands of operational queries daily, that difference compounds into real budget. This is one of several reasons why agentic workloads that require consensus modeling tend toward on-premises deployment.
SUB-AGENTS: SPECIALIZATION REDUCES ERROR RATES
The second structural approach to hallucination control is specialization. A generalist model asked about Nokia CLI commands for a specific OLT model will have a higher error rate than a model fine-tuned specifically on Nokia carrier documentation.
XSI LodeStone's Skill Library architecture implements this through specialized sub-agents: purpose-built agents with access to curated, vendor-specific documentation and command libraries for Nokia, Calix, and Adtran equipment. The Nokia sub-agent doesn't use general world knowledge to guess a CLI command sequence — it retrieves from a validated, carrier-grade reference set.
This is not a "nice to have" for enterprise AI. This is how you get error rates low enough for production operations. The hallucination problem doesn't disappear, but the surface area shrinks dramatically.
This is the combination that production agentic AI requires: specialization to reduce the error surface, consensus to catch the errors that remain, and human oversight for actions above a defined risk threshold. Not one of these alone. All three together.
WHAT THIS MEANS FOR DEPLOYMENT DECISIONS
If you're deploying AI agents in a context where agent errors have operational consequences — network operations, financial transactions, compliance filings, medical decisions — single-model inference is not sufficient. The benchmark error rates, even for the best models, are too high for autonomous action in production environments.
The appropriate architecture combines consensus modeling across multiple independent model instances, specialization through domain-specific Skill Libraries, graduated autonomy with human oversight gates for high-risk actions, and audit trails for every agent action — inputs, reasoning, outputs, consensus results.
This is not an argument against AI agents. It is an argument for deploying them with engineering discipline proportional to the stakes of the decisions they make. An agent automating marketing personalization operates at a different risk than an agent resetting an OLT. They need different reliability architectures.
Build accordingly.
Rhyan J. Neble | Founder & CEO, Extended Systems Intelligence | rneble@xtendedsystems.com | xsilodestone.ai
Q&A with Rhyan
Extended questions from discussions — answered in full.
Even the best models show meaningful error rates: Gemini-2.0-Flash at 0.7% on summarization, Claude at roughly 3%. On open-ended factual queries, rates climb dramatically. On domain-specific technical queries—exactly what you need for network operations—rates of 10-20% are common. For a chatbot, 3% is manageable; for an agent executing CLI commands on live equipment, it is not.
When a chatbot hallucinates, a human reads and catches the error; the cost is a wasted interaction. When an agent hallucinates and takes action based on it, the cost is the action itself. An agent that misidentifies an OLT fault and executes the wrong CLI sequence potentially cascades a single device fault into service outage for hundreds of subscribers.
Running the same query against multiple independent model instances and requiring agreement before action dramatically reduces hallucination probability. Research from Amazon's UAF framework showed 8% accuracy improvement by combining multiple LLMs weighted by accuracy. XSI LodeStone requires consensus across multiple sub-agents before actions affecting live systems execute.
Yes. Cloud AI at $3-25 per million output tokens makes running three consensus checks cost 3x as much as one. On-premises inference is essentially free at the margin, so you can run three model instances for the same dollar cost as one. This changes the optimization from accuracy vs. cost efficiency to accuracy vs. speed—a much better trade-off.
Common Questions
Search-ready answers to the questions we hear most often.
Hallucination is when a model produces a confident, plausible answer that is factually incorrect. A 2025 proof confirmed hallucinations cannot be fully eliminated under current LLM architectures—they're a consequence of next-token prediction optimized for plausibility rather than truth. Error rates range from <1% on simple summarization to 10-20% on domain-specific technical queries.
When a chatbot hallucinates, a human reads the output and catches it; cost is a wasted interaction. When an agent hallucinates and acts on it, the cost is the autonomous action itself. An agent that misidentifies an OLT fault and executes the wrong CLI potentially cascades a single device problem into a service outage affecting hundreds of subscribers.
Gemini-2.0-Flash achieves 0.7% error on summarization; Claude roughly 3%. On open-ended factual queries, rates climb dramatically. On domain-specific technical queries—exactly what network operations need—rates of 10-20% are common. For context: these error rates mean 1 in 5 network operations queries could be hallucinated.
Consensus modeling runs the same query against multiple independent model instances and requires agreement before action. If one model might hallucinate, the probability two independent models produce identical hallucinations is dramatically lower. Research shows 8% accuracy improvement by combining models, with self-consistency checking reducing hallucination rates by up to 65% in some configurations.
Running three model instances costs roughly 3x compute and produces results in 1.5-2x the time of single inference. On-premises, this is acceptable: single inference ~200-400ms, consensus check ~300-600ms. For fault diagnosis, this latency is irrelevant next to reliability improvement. Cloud pricing makes consensus 3x as expensive, creating economic pressure toward single-model deployment with higher risk.
A generalist model asked about Nokia CLI commands for a specific OLT has higher error rate than a model fine-tuned on Nokia carrier documentation. Specialized sub-agents with access to curated, vendor-specific documentation dramatically shrink the hallucination surface area. Combined with consensus and human oversight gates, this creates the production reliability that agentic AI requires.