There Are Nearly 2.8 Million Models on Hugging Face. <em>Your ISP Needs Exactly the Right Six.</em>

I spent five years at ETI Software Solutions watching ISP operations teams struggle with complexity. I learned something important: the problems they face aren't new, but the tools to solve them had been stuck in the past. That's what led to XSI LodeStone. But here's what surprised me during development: the hardest problem wasn't building AI agents that could do ISP work. It was deciding which AI models to use. The paradox of choice is real, and it's paralyzing. Hugging Face alone hosts nearly 2.8 million models — growing by over 4,000 per day. Ollama has thousands. NVIDIA NGC is growing. Together AI, Replicate, and two dozen other platforms offer their own curated collections. And that's just the open ecosystem. Every cloud provider has proprietary models too. For an ISP operator evaluating AI solutions, this isn't abundance. It's paralysis.

THE WRONG QUESTION

When I talk to ISP leadership about AI, they usually ask: "Which model should we use?" It's the wrong question. The right question is: "Which model should we use for each specific operational task — and how do we validate that choice against our actual infrastructure, our actual NOC operations, and our actual failure scenarios?" Here's why this matters. I've seen demos where a large model — 120B parameters, cutting-edge benchmark scores — performs beautifully on a vendor's laptop. The same model, compressed down to 7B or 13B to run on edge hardware, degrades in ways that are unpredictable and domain-specific. A general-purpose model benchmark doesn't tell you how it performs when asked to generate a specific OLT CLI sequence under a specific fault condition on your specific equipment. It doesn't tell you if it hallucinates. It doesn't tell you if it understands the difference between similar but critical configuration commands. That gap between the benchmark and production is where ISP operators get hurt.

THE QUANTIZATION TRAP

Let me be concrete. A model benchmarked at full 120B parameters might score 92% on a standard reasoning benchmark. When you compress it to 7B — quantize it, distill it, optimize it for a smaller appliance — that score doesn't drop to 91%. It doesn't drop linearly at all. The degradation is non-linear and domain-specific. In some domains, the 7B version is still useful. In others, it's unreliable. And you often don't know which until you test it against real work. This is the core risk of quantization: while it makes a model smaller so it can run on less capable hardware, it can and often does result in degraded performance, degraded accuracy, and increased hallucination. What performs brilliantly in a vendor demo running the full 120B model will behave differently when deployed as the quantized version that actually runs in your data center. So how do you solve this? You don't try to make one model do everything. And you don't skip the validation work.

Characteristic	Full-Precision 120B	Quantized 7B
Public Benchmark Score	92%	Estimated 85-88% (non-linear)
Inference Speed	Slower, requires robust hardware	Fast, edge-compatible
Domain-Specific Performance	Reliable across all domains	Unpredictable, varies by task
Hallucination Rate	Lower	Higher (compression degrades)
CLI Generation Accuracy	Tested, validated in production	Requires independent validation
Hardware Requirements	Requires significant compute	Runs on LodeStone appliance hardware

ARCHITECTURE OVER BENCHMARKS

XSI LodeStone ships with six carefully selected models, each optimized for a specific operational domain. The first — and the one we don't compress — is NVIDIA's Nemotron 120B Mixture of Experts model, which serves as the platform orchestrator. This is the system that manages the onboarding wizard, routes tasks to the right agents, handles escalation logic, and coordinates multi-step workflows. It's the brain that decides what to do. This is the one model where you cannot afford degradation. MoE architecture is key here. Mixture of Experts means only a fraction of the 120B parameters activate on any given query. You get the reasoning depth of a 120B model with inference speed closer to a 30-40B model. That's why it works locally on a single LodeStone appliance without requiring massive computational overhead. Then we have domain-specific models for each operational skill:

Development — code generation, API integration, script automation. This model needs to understand your ISP's ecosystem and generate working scripts, not plausible-looking ones.
QA and Testing — validation, test generation, regression analysis. It needs to understand test coverage and failure modes, not just generate code that compiles.
Customer Service — subscriber interaction, ticket routing, knowledge base queries. It needs to understand tone, context, and when to escalate. Getting this wrong doesn't fail silently; it fails in front of customers.
DevOps — infrastructure monitoring, deployment automation, log analysis. It needs to parse complex log structures and recommend actions, not generate plausible-sounding infrastructure advice.
Network Management — this is where domain expertise matters most. The model needs to understand hardware vendor CLI syntax across the platforms you actually run, understand the difference between similar commands that have very different outcomes, and diagnose faults from network telemetry.
Content Creation — compliance documentation, report generation, communication drafting. It needs to generate correct content first, then polished content, not polished-sounding hallucinations.

Each of these domains got months of evaluation, testing against real ISP operational data, validation on real equipment, and iteration. The model selection for each domain wasn't "which is cheapest" or "which scores highest on a public benchmark." It was "which performs most reliably on the actual work an ISP operator needs to do." That curation process is the product. The hard work isn't running inference. Any vendor can download a model from Hugging Face and run it on hardware. The hard work is the months of evaluation, the benchmarking against real operational scenarios, the edge case testing, the validation that a model trained on public internet data can generate correct commands for enterprise network infrastructure. That domain knowledge, that validation, that curated architecture — that's what cannot be replicated by downloading a model from Hugging Face and hoping it works.

The curation process is the product. Not the models themselves, but the domain expertise, validation, testing, and iteration that went into selecting the right model for each operational task.

— Why model selection matters more than model capability

THE ON-PREMISES ADVANTAGE

Here's something that changes the entire calculus: when inference is free at the margin, you can afford to run the right model for each task instead of the cheapest model that sort-of works. Cloud AI forces a cost optimization. You pay per token, so you're incentivized to use the smallest model that can possibly handle the work, even if a larger model would be significantly more reliable. You're trying to minimize cost, which means you're accepting reliability tradeoffs. On-premises hardware changes that equation. Once the hardware is paid for, inference is free. The cost per token is zero. That means you can afford to use the right tool for the job — the Nemotron 120B MoE for orchestration, the specialized model for network management, the validated model for code generation — without economic pressure to downgrade to a cheaper alternative. For an ISP operator, this is the leverage you get from on-premises deployment. Not just latency. Not just data sovereignty. But the architectural freedom to optimize for reliability instead of cost-per-token.

WHAT TO ASK YOUR AI VENDOR

If you're evaluating AI solutions for ISP operations, here's what to ask: How many models are running? If they say "one," ask why. A single general-purpose model is unlikely to be optimal across development, network management, and customer service. Which models and why? Ask for specifics. Hugging Face model IDs. Training data. Benchmarks if they have them. But more importantly: ask about validation against your specific equipment and your specific operational tasks. Public benchmarks tell you one thing. Performance on your network equipment tells you what matters. What happens when the model is compressed? How does performance degrade when the model is optimized for edge deployment? Have they tested this? Do they have a validation framework? How is the orchestration layer built? Is it a single model or a multi-component system? How reliable is routing logic? What happens when a task is escalated? What is the curation process? This is the real differentiator. Not the models themselves, but the domain expertise, the validation, the testing, the iteration that went into selecting the right model for each operational task. Most vendors skip this work. They download a model, run some benchmarks, and call it a day. That's why most AI solutions underperform in production. The complexity of model selection isn't something to ignore. It's something to lean into. It's where the real value lives.

SCALING THE ARCHITECTURE

Here's the practical reality of running seven models — one orchestrator plus six domain specialists — on local hardware: they don't all need to be in memory simultaneously. A NOC operator isn't generating code, handling a customer ticket, and writing compliance documentation in the same second. The orchestrator routes each task to the right domain, and only that domain model needs to be active. The XSI LodeStone platform manages this automatically. The orchestrator stays resident in memory at all times — it's the one model that must always be ready. Domain models load on demand when the orchestrator routes a task, and the platform manages memory allocation intelligently. The most frequently called models stay hot; others swap in as needed. On NVMe storage, a domain model loads in seconds, not minutes. But here's where it gets interesting: LodeStone appliances are designed to scale.

STARTER

Single Appliance

Orchestrator + 2-3 hot domain models. Others swap on demand.

GROWTH

Multi-Appliance

Add appliances. All 6 domains stay resident. Zero swap latency.

DISTRIBUTED

Edge + Central

Edge handles latency-critical. Central NOC handles batch and async.

ENTERPRISE

Full Precision

All models at full precision. Multi-tenant. Massive headroom.

A single LodeStone appliance is the entry point. Add a second, and the platform automatically detects the new hardware, redistributes domain models across both units, and eliminates swap latency entirely. Deploy an appliance at the network edge for latency-critical operations like fault diagnosis, with a central appliance at the NOC handling batch workloads like compliance reporting and content generation. The platform detects the topology and routes accordingly. For operators with enterprise-scale infrastructure, the architecture scales to rack-level hardware where every model runs at full precision with headroom for fine-tuning and multi-tenant deployment. No quantization compromises. No swap latency. The same orchestrator, the same domain models, the same curation — just more room to run.

The architecture scales from a single appliance running two hot models to a distributed fleet with edge and central nodes — same orchestrator, same curation, same validated models. You grow the hardware; the platform adapts automatically.

— LodeStone scalability by design

This is also where vertical portability becomes real. The LodeStone architecture — the orchestrator, the domain slots, the model management layer — is the platform. The models within each slot are what change. Today those slots are filled with telecom-specific models validated for ISP operations. But the same six-slot architecture works for financial services, where QA becomes regulatory compliance validation and Network Management becomes risk modeling. It works for healthcare, where Customer Service becomes clinical NLP and Content Creation becomes medical documentation. The platform is vertical-agnostic. The curation is vertical-specific. That's the business, and that's the moat.

Deep Dive

Q&A with Rhyan

Extended questions from discussions — answered in full.

MoE architecture is crucial for orchestration because it provides the reasoning depth of a large model (120B parameters) with the inference speed of a much smaller model (30-40B equivalent). Only a fraction of parameters activate on any given query, which means you get sophisticated decision-making for task routing, escalation logic, and workflow coordination without massive computational overhead. This allows the orchestrator to run locally on a LodeStone appliance without requiring external API calls or expensive cloud inference, reducing latency and improving reliability for time-critical ISP operations.

Validation is the core of the curation process and cannot be skipped. Each model, particularly the network management model, is tested against real ISP operational data and validated on actual equipment. This means running the model against real OLT and access platform CLI sequences from the hardware vendors you actually deploy — verifying it understands the difference between similar commands that have very different outcomes. We test edge cases, fault scenarios, and real network telemetry to ensure the model can diagnose problems accurately. Public benchmarks don't capture this domain-specific performance—only hands-on testing against real infrastructure does. A model might score well on general reasoning benchmarks but fail when asked to generate a specific CLI sequence or diagnose a fault condition on your equipment.

The appliance benefits from the on-premises advantage: you have full control over when and how to update models. When new models are released, you can evaluate them against your specific operational tasks and equipment. You're not forced into a cloud-based release cycle or automatic updates that might degrade performance. You can validate new models in a testing environment before deploying them to production, ensuring they maintain the reliability and domain expertise you've already established. This is different from cloud-based AI where you're at the mercy of vendor update schedules and performance can change without your input.

A single LodeStone appliance has 128GB of unified memory. The orchestrator — which must always stay resident — takes a significant portion of that. After the orchestrator, there's enough memory to keep two to three domain models loaded simultaneously, with others swapping in on demand. The platform manages this automatically: the orchestrator routes a task, the model swap manager loads the required domain model in seconds via NVMe, runs inference, and manages memory. For most ISP operations, this is seamless — a NOC operator isn't running code generation and customer service tasks in the same second. When you need all six domains hot simultaneously with zero swap latency, you add a second LodeStone appliance. The platform auto-detects the new hardware and redistributes models across both units. This scales further: edge appliances for latency-critical network operations, a central appliance at the NOC for batch workloads, and enterprise-grade hardware for full-precision deployment of every model with headroom to spare.

The LodeStone platform — the Nemotron orchestrator, the six domain slots, the model swap manager, the hardware detection layer — is vertical-agnostic. What changes per industry are the models within each slot and the validation criteria. For financial services, the QA slot becomes regulatory compliance validation (SOX, PCI-DSS, AML/KYC), Customer Service becomes financial NLP, and Network Management is replaced by risk modeling. For healthcare, Customer Service becomes clinical NLP, QA becomes medical coding validation (ICD-10, CPT), and Content Creation shifts to clinical documentation. The orchestrator stays the same. The model management layer stays the same. The on-premises deployment story actually gets stronger in both verticals — FinTech has regulatory audit requirements and healthcare has HIPAA mandates that make cloud AI architecturally problematic. The curation work is what changes, and that's where the domain expertise lives.

The initial deployment uses XSI's curated model selection—the result of months of testing and validation across real ISP environments. However, the architecture is designed around the principle that operators own their infrastructure and data. The curation process is transparent: we specify exactly which models are running, their sources, and the validation results. Operators can see what's been tested and why. Advanced operators with domain expertise can work with XSI to evaluate alternative models, but this requires the same rigorous validation process. This isn't about locking customers in; it's about ensuring that any model running in production has been properly tested against real operational scenarios and real equipment—not just downloaded from Hugging Face and hoping for the best.

People Frequently Ask

Common Questions

Search-ready answers to the questions we hear most often.

A Mixture of Experts (MoE) model is a neural network architecture where different sections of the network (called "experts") specialize in different types of problems. Instead of all parameters activating on every query, a routing mechanism selects which experts are needed for the specific task. This means for any given input, only a fraction of the total parameters are used. So a 120B parameter MoE model might activate only 30-40B parameters per query, giving you the reasoning depth of a very large model with the speed and efficiency of a much smaller one. This is particularly valuable for orchestration tasks that require sophisticated reasoning but need to run locally without excessive compute requirements.

Model quantization is the process of reducing the precision of a model's parameters—typically from 32-bit floating point down to 8-bit or lower precision. This makes the model smaller, faster, and able to run on less powerful hardware. However, quantization isn't a linear process. A model that scores 92% on a benchmark at full precision might not degrade to 91% when quantized—the degradation is often non-linear and domain-specific. In some domains the quantized version performs well; in others it becomes unreliable. This is why ISP operators need to validate quantized models against real operational tasks and equipment, not just trust that public benchmark scores will hold up.

XSI LodeStone deploys six carefully selected models, each optimized for a specific operational domain. The orchestrator uses NVIDIA's Nemotron 120B MoE model (the only model deployed at full size). Domain-specific models handle development, QA and testing, customer service, DevOps, network management, and content creation. This multi-model approach is fundamentally different from single-model solutions. A single general-purpose model is unlikely to perform optimally across all these different operational tasks, especially when compressed for edge deployment.

NVIDIA Nemotron is a family of large language models developed by NVIDIA, available in both dense and Mixture of Experts (MoE) variants. The 120B Nemotron MoE model is used by XSI LodeStone as the orchestration layer—the decision-making component that routes tasks to the right agents, handles escalation logic, and coordinates multi-step workflows. Nemotron was trained on a high-quality dataset designed to improve instruction following and reasoning, making it suitable for enterprise applications that require reliable decision-making and task coordination.

Model curation is selecting the right pre-trained model for a specific task based on rigorous validation and testing. It answers the question: "Which existing model performs most reliably on this operational task?" Fine-tuning is taking a pre-trained model and adapting it by training it on domain-specific data. Curation is lighter-weight—it doesn't require retraining—but it requires extensive testing against real operational scenarios. The curation process behind XSI LodeStone involved months of evaluation against real ISP infrastructure, real operational tasks, and real failure scenarios. This domain expertise and validation work is what creates the value, not the models themselves.

Yes, one of the advantages of on-premises deployment is that you have complete control over updates. Once a model is deployed locally, it runs entirely on your infrastructure and doesn't require internet connectivity to operate. Updates can be staged, tested, and deployed on your schedule through your own deployment processes. This is different from cloud-based AI where you're dependent on the vendor's update schedule and have less control over when changes take effect. For ISP operators managing mission-critical network operations, this local control is essential.

Yes. The LodeStone platform is designed for scalable deployment from day one. A single LodeStone appliance runs the orchestrator plus two to three domain models simultaneously, swapping others on demand. Adding a second appliance lets the platform auto-detect the new hardware and redistribute models so all six domains stay resident with zero swap latency. For larger operators, LodeStone appliances can be deployed at the network edge for latency-critical operations like fault diagnosis, with a central appliance at the NOC handling batch and async workloads. Enterprise-grade hardware provides enough memory to run every model at full precision with headroom for fine-tuning and multi-tenant deployment. The platform detects the hardware topology and adapts automatically.

The LodeStone platform architecture — the Nemotron orchestrator, six domain model slots, model swap management, and hardware scaling — is vertical-agnostic. The current deployment is optimized for ISP and telecom operations because that is where the deepest curation and validation work has been done. However, the same architecture applies to other regulated industries. For financial services, domain models shift to regulatory compliance, financial NLP, and risk modeling. For healthcare, they shift to clinical NLP, medical coding validation, and clinical decision support. The on-premises deployment advantage gets even stronger in these verticals, where regulatory requirements (SOX, HIPAA) make cloud AI architecturally problematic. The platform stays the same; the curated models change per vertical.

About the Author

Rhyan J. Neble

Founder & CEO, Extended Systems Intelligence

Follow on LinkedIn Request Platform Access xsilodestone.ai

There Are Nearly 2.8 Million Models on Hugging Face. Your ISP Needs Exactly the Right Six.