Why can’t AI just admit when it doesn’t know? Uncertainty, calibration, and how to fix it
On this Reddit thread, a user asks why tools like Gemini, ChatGPT, Blackbox AI and Perplexity still struggle to say “I don’t know”. It’s a fair question—and one that matters as we start using AI for research, coding, and decisions that carry real risk.
Fake confidence and hallucinations feel worse than saying “Idk, I’m not sure.”
Here’s what’s going on under the hood, why the problem persists, and how teams can design systems that know their limits.
How large language models handle uncertainty
Modern AI assistants are large language models (LLMs) built on the transformer architecture—a neural network that predicts the next token (a fragment of text) given context. They generate text one token at a time based on probabilities, not ground-truth verification. That means they have token-level uncertainty, but not a robust sense of when an overall answer is reliable.
Two important concepts:
- Calibration: Whether a model’s confidence matches reality (e.g., “80% confident” answers are right ~80% of the time). LLMs tend to be poorly calibrated out of the box.
- Alignment: The process (often via reinforcement learning from human feedback, or RLHF) of making the model helpful, harmless and honest. Alignment can unintentionally reward confident-sounding answers over cautious ones.
Even if a model “knows it’s uncertain” at the token level, the way we decode text and the incentives used during training can still push it to sound sure of itself.
Why models dodge “I don’t know”
Training and feedback nudge models toward confidence
- Instruction tuning and RLHF: Systems are tuned to be helpful and complete tasks. In datasets and feedback, decisive answers often score higher than deferrals, so “I’m not sure” becomes rarer.
- Decoding choices: Settings like temperature and top-p sampling influence how deterministic or verbose a model is. Defaults often favour fluent, assertive prose.
Product UX and business incentives
- Refusals frustrate users: Teams optimise for fewer “I can’t answer that” moments, even if that increases the risk of mild hallucination.
- Benchmarks reward coverage: Many public benchmarks value answering widely, not abstaining accurately when unsure.
Architecture limits
- No built-in truth checking: LLMs don’t verify facts by default. Without tools or retrieval, they synthesise from patterns in training data.
- Token confidence ≠ answer confidence: A smooth, confident paragraph can be stitched from high-probability tokens while still being wrong.
Practical fixes: prompting, product design and system architecture
Good news: we can improve abstention and calibration without waiting for the next model release. Here’s what works in practice.
| Technique | What it does | Where it helps |
|---|---|---|
| Answerability classifier (selective prediction) | Pre-model or post-model check determines if the question should be answered or deferred | Prevents confident nonsense on novel or ambiguous queries |
| Retrieval-augmented generation (RAG) | Fetches documents first, then generates with citations | Grounds answers in sources; enables “no results found” abstentions |
| Structured uncertainty | Require a confidence score and a reason for uncertainty | Makes model uncertainty explicit and auditable |
| Logprobs thresholds | Use token probabilities to trigger abstention | Simple, works well for short-form classification or extraction |
| Temperature scaling | Post-hoc calibration on validation data | Improves probability calibration for classification tasks |
Helpful implementation patterns:
- Prompts: Instruct models to abstain when unsure, ask clarifying questions first, and provide sources. Require a final line labelled “Confidence: X%”.
- Tools-first generation: Force retrieval or a calculator before the model can answer, especially for finance, health, or legal use cases.
- Selective evaluation: Measure “accuracy @ 95% coverage” and “coverage at target accuracy”, not just raw accuracy, so abstention is rewarded.
If you’re building internal workflows—say, QA or reporting inside Google Sheets—add guardrails and abstentions before you trust outputs. For a simple integration walkthrough, see my guide to connecting ChatGPT to Google Sheets.
What this means for UK teams: risk, compliance and trust
In the UK, the stakes are high in regulated domains. The ICO’s guidance on AI and data protection expects transparency, human oversight, and accuracy appropriate to the context. In finance, law, and healthcare (think NHS triage and admin), a model that confidently guesses is a liability.
Practical considerations:
- Procurement: Ask vendors how their systems abstain, how calibration is measured, and whether you can set logprobs thresholds or require citations.
- Auditing: Keep records of prompts, sources, and confidence scores. Useful for compliance and internal QA.
- Data scope: RAG with explicit, approved corpora helps you limit where the model can “know” things and when it must say “no data”.
Will next-gen AIs be better at knowing their limits?
Progress is likely. Vendors are adding uncertainty outputs, tool-use by default, chain-of-verification, and better calibration techniques. Expect models to get more conservative when they lack sources, and product defaults to favour citing or abstaining over bluffing.
But it’s not just a model problem—it’s a system problem. You’ll get the biggest gains by combining a strong base model with answerability checks, retrieval, and clear product rules about when to say “I don’t know”.
Further reading
- InstructGPT and the role of RLHF: “Training language models to follow instructions”
- Calibration and temperature scaling: “On Calibration of Modern Neural Networks”
- Retrieval-augmented generation (RAG): “Retrieval-Augmented Generation for Knowledge-Intensive NLP”
- OpenAI text generation and logprobs: API guide
- UK government – AI Safety Institute overview: gov.uk