Large language models have become remarkably powerful, but their tendency to generate confident-sounding falsehoods — known as hallucinations — remains a persistent and costly problem. While the industry has experimented with various error-catching techniques, a new startup called Probably believes it has found a more rigorous solution. The company announced today that it has raised $9 million in seed funding from Andreessen Horowitz to bring its approach to market.

Building a ‘mech suit’ for data science

Probably’s first product is a data science tool designed to produce quick, verifiable answers from complex datasets. Each result comes with a citation and an audit trail, a practice that is becoming more common among AI-powered analytics tools. But the core innovation lies in what founder Peter Elias describes as a “data science mech suit” — an elaborate harness system that prevents errors from ever reaching the user.

The system works by having the LLM generate a first-pass answer, which is then checked against a deterministic validator. Any result that does not match the dataset is bounced back. Crucially, the LLM has been trained specifically to work with this validator, and the entire system is optimized for both speed and accuracy. Elias noted that this approach allows the system to run on significantly smaller AI models, reducing token costs substantially.

Smaller models, lower costs, higher accuracy

One of the most striking findings from Probably’s development process is that the quality of the harness engineering can compensate for the power of the underlying model. “What we learned building this was that the better your harness engineering is, the weaker the model can be,” Elias said. “If you can refine the context enough, the model does not have to work very hard to do the right thing. Basically, it’s an exercise in reducing ambiguity.”

This allows Probably’s tool to run on a model that is “four classes weaker than the frontier models,” meaning it can operate on local hardware — a desktop computer rather than a data center. This dramatically reduces token costs at a time when many enterprises are reassessing their AI budgets amid rising expenses.

Implications for precision-sensitive industries

While the initial product is focused on data science, Elias sees the same engine being extended to other “precision-sensitive use cases,” including accounting and medical services. The approach is notable because it does not require a more powerful LLM, but rather a more disciplined system around it. Elias pointed out that the largest AI labs have not pursued this path, suggesting that their business models may not incentivize reducing the number of corrections a user must make.

“I think it’s really interesting that the big AI labs have not even attempted to do this,” Elias said. “They’re incentivized not to, because they make money the more times you have to correct the model.”

Conclusion

Probably’s approach represents a shift in thinking about LLM reliability: instead of trying to build a perfect model, it focuses on building a perfect system around a good enough model. With $9 million in seed funding and a clear focus on verifiability, the company is positioning itself as a key player in the growing market for trustworthy enterprise AI. The challenge now will be proving that its deterministic validation layer can scale across different industries without introducing new bottlenecks.

FAQs

Q1: What does Probably’s product do?
It is a data science tool that uses a combination of an LLM and a deterministic validator to produce accurate, cited answers from complex datasets, with an audit trail for each result.

Q2: How does Probably reduce hallucinations?
By using a deterministic validator system that checks every LLM-generated answer against the original dataset, rejecting any results that do not match. The LLM is trained to work with this validator, and the whole system is optimized for accuracy.

Q3: Why can Probably use smaller models?
Because the harness engineering around the model reduces ambiguity, allowing a less powerful model to produce accurate results. This also allows the system to run on local hardware, cutting token costs significantly.

Disclaimer: The information provided is not trading advice, Bitcoinworld.co.in holds no liability for any investments made based on the information provided on this page. We strongly recommend independent research and/or consultation with a qualified professional before making any investment decisions.

Probably raises $9M to build a reliability layer for LLMs, targeting 99.99% accuracy

Building a ‘mech suit’ for data science

Smaller models, lower costs, higher accuracy

Implications for precision-sensitive industries

Conclusion

FAQs

Tags:

Keshav Aggarwal

Probably raises $9M to build a reliability layer for LLMs, targeting 99.99% accuracy

Building a ‘mech suit’ for data science

Smaller models, lower costs, higher accuracy

Implications for precision-sensitive industries

Conclusion

FAQs

Tags:

Share This Post:

Keshav Aggarwal

Hyperliquid ETFs Attract $172M in Net Inflows as Bitcoin Funds See Continued Outflows