Why AI Struggles to Say "I Don't Know": Uncertainty, Calibration, and How to Fix It

Discover why AI systems struggle with uncertainty and how calibration can enhance their ability to admit when they don't know.

28 September 2025by Joshua Thompson4 min read160 views

Why can’t AI just admit when it doesn’t know? Uncertainty, calibration, and how to fix it

On this Reddit thread, a user asks why tools like Gemini, ChatGPT, Blackbox AI and Perplexity still struggle to say “I don’t know”. It’s a fair question-and one that matters as we start using AI for research, coding, and decisions that carry real risk.

Fake confidence and hallucinations feel worse than saying “Idk, I’m not sure.”

Here’s what’s going on under the hood, why the problem persists, and how teams can design systems that know their limits.

How large language models handle uncertainty

Modern AI assistants are large language models (LLMs) built on the transformer architecture-a neural network that predicts the next token (a fragment of text) given context. They generate text one token at a time based on probabilities, not ground-truth verification. That means they have token-level uncertainty, but not a robust sense of when an overall answer is reliable.

Two important concepts:

Calibration: Whether a model’s confidence matches reality (e.g., “80% confident” answers are right ~80% of the time). LLMs tend to be poorly calibrated out of the box.
Alignment: The process (often via reinforcement learning from human feedback, or RLHF) of making the model helpful, harmless and honest. Alignment can unintentionally reward confident-sounding answers over cautious ones.

Even if a model “knows it’s uncertain” at the token level, the way we decode text and the incentives used during training can still push it to sound sure of itself.

Why models dodge “I don’t know”

Training and feedback nudge models toward confidence

Instruction tuning and RLHF: Systems are tuned to be helpful and complete tasks. In datasets and feedback, decisive answers often score higher than deferrals, so “I’m not sure” becomes rarer.
Decoding choices: Settings like temperature and top-p sampling influence how deterministic or verbose a model is. Defaults often favour fluent, assertive prose.

Product UX and business incentives

Refusals frustrate users: Teams optimise for fewer “I can’t answer that” moments, even if that increases the risk of mild hallucination.
Benchmarks reward coverage: Many public benchmarks value answering widely, not abstaining accurately when unsure.

Architecture limits

No built-in truth checking: LLMs don’t verify facts by default. Without tools or retrieval, they synthesise from patterns in training data.
Token confidence ≠ answer confidence: A smooth, confident paragraph can be stitched from high-probability tokens while still being wrong.

Practical fixes: prompting, product design and system architecture

Good news: we can improve abstention and calibration without waiting for the next model release. Here’s what works in practice.

Technique	What it does	Where it helps
Answerability classifier (selective prediction)	Pre-model or post-model check determines if the question should be answered or deferred	Prevents confident nonsense on novel or ambiguous queries
Retrieval-augmented generation (RAG)	Fetches documents first, then generates with citations	Grounds answers in sources; enables “no results found” abstentions
Structured uncertainty	Require a confidence score and a reason for uncertainty	Makes model uncertainty explicit and auditable
Logprobs thresholds	Use token probabilities to trigger abstention	Simple, works well for short-form classification or extraction
Temperature scaling	Post-hoc calibration on validation data	Improves probability calibration for classification tasks

Helpful implementation patterns:

Prompts: Instruct models to abstain when unsure, ask clarifying questions first, and provide sources. Require a final line labelled “Confidence: X%”.
Tools-first generation: Force retrieval or a calculator before the model can answer, especially for finance, health, or legal use cases.
Selective evaluation: Measure “accuracy @ 95% coverage” and “coverage at target accuracy”, not just raw accuracy, so abstention is rewarded.

If you’re building internal workflows-say, QA or reporting inside Google Sheets-add guardrails and abstentions before you trust outputs. For a simple integration walkthrough, see my guide to connecting ChatGPT to Google Sheets.

What this means for UK teams: risk, compliance and trust

In the UK, the stakes are high in regulated domains. The ICO’s guidance on AI and data protection expects transparency, human oversight, and accuracy appropriate to the context. In finance, law, and healthcare (think NHS triage and admin), a model that confidently guesses is a liability.

Practical considerations:

Procurement: Ask vendors how their systems abstain, how calibration is measured, and whether you can set logprobs thresholds or require citations.
Auditing: Keep records of prompts, sources, and confidence scores. Useful for compliance and internal QA.
Data scope: RAG with explicit, approved corpora helps you limit where the model can “know” things and when it must say “no data”.

Will next-gen AIs be better at knowing their limits?

Progress is likely. Vendors are adding uncertainty outputs, tool-use by default, chain-of-verification, and better calibration techniques. Expect models to get more conservative when they lack sources, and product defaults to favour citing or abstaining over bluffing.

But it’s not just a model problem-it’s a system problem. You’ll get the biggest gains by combining a strong base model with answerability checks, retrieval, and clear product rules about when to say “I don’t know”.

Tagged

Model Agnostic

Last updated

6 July 2026

Star Rating

No ratings yet

Comments

No comments yet - start the conversation.