Beware of Confident Liars: Why Your Data Matters More Than Your Model Size
There's a common refrain in AI circles: open-source LLMs can't compete, small models are garbage, and frontier providers like Anthropic or OpenAI will always outperform everyone else.
That's increasingly untrue. Unless your task is writing a poem or an essay about the French Revolution.
For real business automation, speed and clarity matter more than raw model capability. Being able to run your AI locally is a big deal if you want to actually own your own destiny. And your data quality is, and will always be, the biggest factor behind "AI success."
Most agents don't need a giant model. A clean pipeline plus a focused 1-4 billion parameter model is often enough to execute a clear task reliably. Frontier models are overkill for most workflows, and defaulting to them can push your organisation in the wrong direction long term.
Sure, big models can be better in many ways. But if your retrieval and your data are mediocre, you don't get intelligence. You get a model making mistakes confidently.
The Research Backs This Up
Salesforce's AI research team puts it plainly: "There's no substitute for hundreds of billions of parameters when you want to be everything to everyone. But in the enterprise, this ability is almost entirely moot." Their findings show that small language models designed for specific, well-defined tasks can easily outperform larger models (Salesforce Blog).
IBM's experience confirms this. Their Granite models cost between 3 and 23 times less than frontier models while matching or outperforming similarly-sized competitors on key benchmarks (IBM Think). For a mid-sized enterprise running 50,000 daily requests, that could mean the difference between a $15,000/month API bill and a $1,500 one.
The evidence from domain-specific applications is even more striking:
- In medicine, a 7 billion parameter diabetes model achieved 87.2% accuracy on clinical tests, while GPT-4 showed 79.17% and Claude 3.5 managed 80.13%
- In legal applications, a model with just 200 million parameters achieves 77.2% accuracy in contract analysis, approaching GPT-4's 82.4%
- For identifying unfair terms in user agreements, small models have outperformed both GPT-3.5 and GPT-4 on the F1 metric
The CEO of Hugging Face has predicted that up to 99% of use cases could be addressed using small language models.
Gartner's analysts name small language models with 500 million to 20 billion parameters as the sweet spot for businesses that want to adopt generative AI without breaking the bank. Deloitte's guidelines suggest opting for a small language model within the 5-50 billion parameter range to initiate a smoother AI journey (Instinctools).
Small Models for Agentic AI
NVIDIA Research recently published a position paper arguing that small language models are "sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, and are therefore the future of agentic AI" (NVIDIA Research).
Their argument is straightforward: agentic systems perform a small number of specialised tasks repetitively and with little variation. You don't need a model that can hold a general conversation about philosophy when the task is "extract the invoice number from this PDF."
Their recommendations for organisations:
- Prioritise SLMs for cost-effective deployment, particularly where real-time or on-device inference is required
- Design modular agentic systems using a heterogeneous model approach: SLMs for routine, narrow tasks and LLMs reserved for complex reasoning
- Leverage SLMs for rapid specialisation through fine-tuning, enabling faster iteration cycles
Beyond cost, small models also unlock use cases that cloud APIs simply cannot serve: environments where sensitive data cannot leave the device, offline or air-gapped operations, latency-critical applications requiring sub-100ms responses, and deployments at scale where 50K+ daily requests make API costs prohibitive.
Not Everything Needs an LLM
Here's something that often gets lost in the hype: generative models like GPT are not the right architecture for every task. Understanding and analysis models (like BERT) dominate for classification, sentiment analysis, named entity recognition, and semantic search. They're often hundreds of times faster, significantly cheaper, can run on CPU, and are frequently more accurate for these specific tasks.
The right tool depends on the job:
| Task | Best Approach |
|---|---|
| Text classification | Understanding models (BERT) |
| Sentiment analysis | Understanding models (BERT) |
| Named entity recognition | Understanding models (BERT) |
| Semantic search | Sentence Transformers |
| Chatbot / conversation | Generative models (LLMs) |
| Text generation | Generative models (LLMs) |
| Code completion | Generative models (LLMs) |
Anyone claiming generative AI is the only way forward clearly hasn't looked at the full picture.
Open Source Has Caught Up
The narrative that proprietary models will always dominate is crumbling.
Meta's Llama 4 Maverick and Scout have been reported to outperform earlier proprietary models across various benchmarks, especially in coding, reasoning, and multilingual capabilities (Shakudo). Llama 4 Scout now offers a 10 million token context window, making it viable for private deployment on complex enterprise tasks.
DeepSeek's R1 model dropped with open weights and outperformed Claude and o1 on core reasoning benchmarks (Botpress). Their latest DeepSeek-V3 series is now among the best open-source options for reasoning and agentic workloads (BentoML).
The practical implication matters: when your models run on your infrastructure, your data stays your data. That's not a philosophical point. It's a competitive advantage.
But even these capable open models share a fundamental limitation with their proprietary counterparts.
The Hallucination Problem Doesn't Scale Away
OpenAI's own research team recently published findings explaining why language models hallucinate: "Language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty" (OpenAI).
The key insight is counterintuitive. Larger models, while generally more capable, also tend to hallucinate with what researchers call "confident nonsense." Model scaling alone does not eliminate hallucination but rather amplifies it in certain contexts (Frontiers in AI).
Research from the University of Oxford published in Nature demonstrates how LLMs can give different answers each time they're asked a question, even with identical wording, confidently generating plausible but imaginary information. Their research shows that "the most dangerous failures of AI come when a system does something bad but is confident" (University of Oxford).
The real-world consequences are mounting. In 2024, Air Canada was ordered to pay damages and honour a bereavement fare policy that was hallucinated by a support chatbot. The tribunal rejected Air Canada's defence that the chatbot was a "separate legal entity responsible for its own actions." In October 2025, hallucinations including non-existent academic sources and a fake quote from a federal court judgement were discovered in an A$440,000 report written by Deloitte and submitted to the Australian government (Wikipedia).
Data Quality Is Your Actual Differentiator
It doesn't matter how smart your model is. If the data isn't there to back it up, it's just guesswork.
Doug Robinson, executive director of the National Association of State CIOs, puts it well: "If you can't trust the quality, integrity and reliability of your data, you can't trust the results of the analysis" (Greene Barrett).
Companies that invest in cleaning up their data and building robust retrieval pipelines get dramatically better results than those who just upgrade to a bigger model.
What This Looks Like in Practice
At MakerX, we treat model selection as an engineering decision, not a default to the biggest option available. The choice compounds across every request.
On a CRM platform, we run sentence transformers in-memory for semantic search and intent matching. No external API call. Sub-50ms responses. A lightweight model handles routing, and a more capable model only comes in for final response generation, where it actually adds value.
On a healthcare engagement involving clinical documentation, we use lightweight models for PII redaction and found a compact model matched the quality of frontier models at a fraction of the cost and latency.
The pattern holds: right tool, right size. Understanding models for search and classification. Small language models for most generation tasks. Frontier models reserved for the few problems that genuinely demand them.
The payoff compounds: lower operating costs, faster responses, higher throughput, and no dependency on a single provider's uptime or pricing changes.
Choosing the Right Tool
The specific models leading benchmarks today will change. Frontier models improve, open-source alternatives catch up, and new architectures emerge. But the underlying principles remain:
Use frontier models when:
- Tasks require broad world knowledge or complex reasoning across multiple domains
- You're prototyping and need flexibility before optimising
- The task genuinely benefits from capabilities that only appear at scale
Use small, focused models when:
- You have a well-defined, specific task
- Speed and cost efficiency matter
- You need to run locally for privacy or data sovereignty
- Your data is clean and your retrieval pipeline is solid
Consider understanding models (BERT-style) when:
- The task is classification, sentiment analysis, NER, or semantic search
- You need sub-100ms latency
- You're running at scale and cost matters
Regardless of architecture:
- Invest in data quality before model capability
- Test for hallucinations in your specific domain
- Remember that a confident wrong answer is worse than an uncertain right one
The Bottom Line
The argument that only frontier models matter is looking increasingly outdated. For real business automation, the model choice is rarely the bottleneck. Your data pipeline is.
If you're evaluating where AI fits in your operations, we'd be happy to chat.