The Confidence Problem

The tool was certain. The employees were certain. The outputs were wrong.

May 27, 2026

The team had a tool for analyzing social media.

Posts, replies, blogs, comments across platforms, the tool ingested it all and returned sentiment analysis, trend identification, and thematic summaries that informed campaign direction and product decisions. It was faster than manual analysis by a significant margin. The outputs were clean and clearly presented. The team trusted them.

The tool did not understand sarcasm. It had limited semantic range. It missed irony, nuance, and the specific kind of performative enthusiasm that social media produces in abundance and that means something quite different from genuine enthusiasm. The outputs looked authoritative. They were systematically wrong in ways that were invisible to the people relying on them.

Campaigns got built on that signal. Product direction got shaped by it. Nobody questioned the outputs because the tool had never given them a reason to. It was confident. They were confident. The confidence was shared and mutually reinforcing and not connected to anything that was actually true about what the tool could reliably do.

This is the confidence problem. Not that AI tools produce incorrect outputs, though they do. That AI tools produce confident outputs, and confidence is contagious in ways that have nothing to do with accuracy.

Why confidence travels

There is a well-documented human tendency to treat fluency as a signal of correctness. A clearly written argument feels more true than a muddled one even when the underlying logic is identical. A confident speaker is more persuasive than a hesitant one even when they know less. This is not a failure of critical thinking. It is how human cognition works under normal operating conditions, calibrated over a long time to treat fluency and confidence as reasonable proxies for reliability.

AI tools are extraordinarily fluent. They produce outputs that are well-structured, clearly presented, and delivered without hesitation or qualification. They do not say I am not sure about this or this is outside my reliable range. They produce the output and move on.

When employees interact with outputs like that repeatedly, the evaluation step, is this correct, is this reliable, is this within the tool’s actual competence, gets replaced by the action step. The tool said so. Now what do we do.

That is not laziness or credulity. It is a rational adaptation to a tool that has never visibly failed, combined with time pressure, organizational expectations around productivity, and a format that signals authority. The marketing, product, and GTM teams using the social media analysis tool were not being careless. They were being efficient in exactly the way the tool was designed to encourage.

The calibration layer this requires

Regular readers of Ground Truth will recognize the connection to the confidence calibration audit framework introduced in the first series, which addresses whether an AI system’s expressed confidence is actually calibrated to its accuracy. That framework is the right starting point and it addresses the problem at the model level. A well-calibrated system is better than a poorly calibrated one. It will surface uncertainty more honestly and give employees better signal about when to trust the output and when to verify it.

It is not sufficient on its own.

Even a perfectly calibrated AI system can produce overconfident employees if the organizational context, the interaction design, and the measurement infrastructure do not account for how people respond to authoritative outputs over time. Calibration addresses what the tool communicates about its own certainty. It does not address what happens to human judgment when that certainty becomes the default reference point for decision-making.

The social media tool was not necessarily poorly calibrated in a technical sense. It returned outputs without appropriate caveats about its semantic limitations, which is a calibration failure, but the deeper problem was that nobody had designed for the human side of the equation. Nobody had asked what happens to employee judgment when they interact with confident outputs day after day. Nobody had built any infrastructure for knowing whether employees could evaluate the outputs critically or whether they had quietly stopped trying.

That is a different problem from model calibration. It requires a different kind of measurement.

What RAG changes and what it does not

Organizations deploying RAG-based systems, tools that retrieve and synthesize from organizational knowledge bases rather than generating from general training data, sometimes treat the architecture as a confidence solution. The thinking goes: if the tool is grounded in our actual documents, our policies, our content, the outputs will be more reliable and employees will be right to trust them more.

This is partially true and worth examining carefully.

RAG does meaningfully reduce certain categories of hallucination and out-of-domain error. A tool grounded in your organization’s actual knowledge base is less likely to confidently produce information that has nothing to do with your organizational reality. That is a genuine improvement.

What RAG does not solve is the retrieval quality problem. A RAG system is only as good as what it retrieves, and retrieval failures are often invisible to the employee receiving the output. The system found something. It synthesized it confidently. The employee has no way of knowing whether the retrieved content was the right content, the most current content, or the most relevant content for the specific question at hand. The output looks the same whether the retrieval was excellent or subtly wrong.

An employee who learns to trust RAG outputs without understanding retrieval quality is not developing domain judgment. They are developing a sophisticated form of tool dependency that feels like expertise and breaks in specific, hard to anticipate ways when the retrieval fails quietly.

What this means for measurement

The confidence problem has a specific implication for how you design capability measurement in AI enablement programs.

You cannot measure capability by asking employees how confident they feel. Confidence is the thing you are trying to disaggregate from capability, not a proxy for it. An employee who feels very confident and an employee who is genuinely capable may look identical on a self-report measure and be doing entirely different things cognitively.

You cannot measure capability by measuring output quality alone either, at least not without comparison conditions. An employee producing high quality outputs with AI assistance may be exercising genuine judgment or may be efficiently laundering the tool’s outputs into a format that passes organizational scrutiny. The output looks the same. The capability behind it is different.

What you actually need are measures that get at judgment specifically, the ability to evaluate an output critically, to identify when it is wrong or incomplete or outside the tool’s reliable range, to make decisions that go beyond what the tool has already decided. Those are harder to measure than completion rates or satisfaction scores. They require more deliberate study design. They are also the only measures that will tell you what you actually need to know.

The next part builds that study design in detail. The confidence problem is the reason the measurement has to be as careful as it does.

The marketing team built campaigns on outputs the tool was never equipped to produce reliably. The tool was confident. The team was confident. The campaigns went out.

Confidence without calibration is not capability. It is a very specific kind of risk that most organizations are currently running without knowing it, and measuring it requires asking questions that feel uncomfortable precisely because the metrics you already have look fine.

That is the point.

This is the second part in The Capability Question, a series on measuring what AI enablement actually does to people. Start with the series introduction here. Next: What a Rigorous Study Actually Looks Like.

Ground Truth is written by Shannon Coleman, PhD. Product strategy leader and researcher with a background in psychology and quantitative methods. Her work spans govtech, healthtech, fintech, and consumer product contexts, with particular depth in enterprise AI platforms. Portfolio: shannonlcoleman.com

Shannon Coleman

Discussion about this post

Ready for more?