From AI true believer to sceptic: what this Reddit post is really saying
A widely shared Reddit post argues that large language models (LLMs) have a reliability crisis. The author says previously “perfect” automations now fail, vendor updates quietly break workflows, and guardrails cost more than hiring humans. They also worry about the lack of auditability while AI systems touch hiring, healthcare and credit.
“Nothing is reliable.”
“It’s like building on quicksand.”
“You can’t version-lock intelligence that doesn’t actually understand what it’s doing.”
The post is frank, frustrated and, in places, hyperbolic. But the underlying issues – non-determinism, model drift, hallucinations and weak governance – are real. Here’s a balanced take and what it means for UK organisations.
Why LLM reliability feels so slippery: version drift, hallucinations and non-determinism
Version drift: your working prompt breaks overnight
Cloud LLMs change. Providers retrain, fine-tune and deprecate models with limited notice. Even if an API identifier stays the same, behaviour can shift. That’s a problem when your logic depends on very specific outputs or refusal patterns. The Redditor claims past workflows “ran perfectly” and are now “useless” on newer models – the specific versions aren’t verified, but the pattern is familiar to many teams.
Mitigations exist (see below), but you can’t truly freeze a black-box model whose owner can update it at will. That’s the trade-off for using managed AI services.
Hallucinations and context limits
LLMs predict text; they don’t “know” in the human sense. Hallucinations – confident-sounding but false statements – crop up more when prompts are vague, when the model is asked to infer facts beyond provided sources, or when the context window (the temporary memory the model reads) isn’t well curated. Bigger context windows help, but they don’t guarantee recall accuracy.
Non-determinism and reproducibility
Even at low temperature (a setting that reduces randomness), responses can vary. Some providers offer a seed parameter to stabilise sampling, but it’s not a silver bullet and isn’t universal. If your workflow needs exact reproducibility, unconstrained generation is a poor foundation.
Does this mean LLMs are “rotting from the inside”?
Not quite. We are likely seeing diminishing returns from scaling alone and lots of rushed productisation. But it’s also true that teams shipping LLM features without software engineering rigour are rediscovering why testing, contracts and change control matter. AI isn’t exempt from those basics – in fact, it needs more of them.
Real gains are still happening in focused niches: question answering grounded on your documents (RAG – retrieval-augmented generation), code review support, meeting summarisation, call notes, and internal search. They work when inputs are constrained, outputs are validated, and humans are kept in the loop.
UK implications: governance, audit and legal risk
For UK organisations, the governance gaps highlighted in the post are more than an annoyance – they touch data protection and accountability duties.
- Automated decisions: Under UK GDPR, individuals have rights related to solely automated decisions with legal or similarly significant effects, including the right to human intervention. Build for that from day one.
- Data protection and DPIAs: The ICO expects Data Protection Impact Assessments for high-risk AI uses and clear purpose limitation, data minimisation and transparency. See the ICO’s guidance on AI and data protection.
- Public sector transparency: The UK Algorithmic Transparency Recording Standard sets expectations for documenting automated systems in government.
- Vendor risk: Check data retention, training-on-your-data settings, model change policies and exit plans. Sign a robust data processing agreement if personal data is involved.
Useful references: ICO guidance on AI and data protection, and the Algorithmic Transparency Recording Standard.
How to build robust LLM workflows despite drift and hallucinations
1) Pin, test and stage
- Version pinning: Where possible, select a specific model version, not a moving alias. Track prompts and system instructions as versioned config.
- Shadow deployments: Test new models and prompts against a held-out test set before promotion. Compare accuracy, refusal rates, cost and latency.
- Fallbacks: Maintain a safe fallback model and a simpler, deterministic path when confidence is low or validation fails.
2) Constrain generation and validate outputs
- Low temperature and structured outputs: Use JSON or function/tool calling to force schema compliance. Validate with a strict parser.
- Deterministic post-processing: Keep parsing, business rules and calculations outside the model in normal code where possible.
- Guardrails: Add rule-based checks (regex, schema, whitelists) and require citations to provided sources for claims.
3) Ground the model with RAG, don’t ask it to “just know”
- Retrieval first: Pull the smallest set of relevant passages, then ask the model to answer “based only on” those.
- Short contexts: Pack only what’s needed. Long, noisy contexts increase error rates and cost.
- Provenance: Return source links and snippets so a human can audit the answer.
4) Human-in-the-loop beats blind automation
- Exception queues: Route low-confidence or high-impact cases to human review. Log reasons and iterate on prompts.
- Sampling: Periodically sample outputs for quality and bias, especially for HR, credit or healthcare-adjacent tasks.
5) Observability and drift alerts
- Metrics: Track answer accuracy (against labelled tests), refusal rate, hallucination rate (e.g., failed citation checks), latency and per-task cost.
- Logging: Store prompts, model IDs, and outputs with PII minimised and access-controlled. Alert on significant metric shifts.
6) Vendor management and portability
- Document dependencies: Keep a list of models, versions and features you rely on (JSON mode, tool calling, vision, etc.).
- Abstraction layer: Use a lightweight interface so you can swap providers or slot in an open-weight model for specific tasks if needed.
Control levers that actually help
| Lever | What it controls | Typical approach |
|---|---|---|
| Temperature | Randomness/creativity | 0.0–0.3 for production tasks |
| Structured output | Schema compliance | JSON or function/tool calling + strict parser |
| Retrieval | Grounding in your data | RAG with top-k passages and citations |
| Seed (if available) | Reproducibility | Fix for tests, not a guarantee in prod |
| Evaluations | Quality over time | Golden datasets, regression tests, drift alerts |
Where LLMs still deliver value – and where they don’t
- Good bets: Drafting, summarising, knowledge retrieval with citations, code suggestions reviewed by engineers, data cleaning suggestions, customer support triage.
- High risk: Solely automated decisions affecting people’s rights or finances; anything requiring strict reproducibility without human oversight.
If you’re automating office tasks and reporting via spreadsheets, start small and observable. I’ve outlined a safe pattern in my guide on connecting ChatGPT to Google Sheets – the principles (validation, logging, fallbacks) apply broadly.
Final thought: frustration justified, solutions available
The Reddit post captures something many teams feel: LLMs are powerful but fickle, and vendor changes can undo your work. That’s not a reason to abandon them; it’s a reason to treat them like probabilistic components that demand testing, constraints and governance.
If you put auditability first, keep humans in the loop, and refuse to ship “magic”, you can get steady value without the quicksand. If you need perfect accuracy and reproducibility, use conventional software – and use LLMs around the edges where they shine.