The Reliability Crisis in LLMs: Version Drift, Hallucinations and How to Build Robust Workflows

LLMs face a reliability crisis from version drift and hallucinations; learn how to build robust workflows to mitigate these issues.

10 May 2026by Joshua Thompson6 min read23 views

From AI true believer to sceptic: what this Reddit post is really saying

A widely shared Reddit post argues that large language models (LLMs) have a reliability crisis. The author says previously “perfect” automations now fail, vendor updates quietly break workflows, and guardrails cost more than hiring humans. They also worry about the lack of auditability while AI systems touch hiring, healthcare and credit.

“Nothing is reliable.”

“It’s like building on quicksand.”

“You can’t version-lock intelligence that doesn’t actually understand what it’s doing.”

The post is frank, frustrated and, in places, hyperbolic. But the underlying issues – non-determinism, model drift, hallucinations and weak governance – are real. Here’s a balanced take and what it means for UK organisations.

Why LLM reliability feels so slippery: version drift, hallucinations and non-determinism

Version drift: your working prompt breaks overnight

Cloud LLMs change. Providers retrain, fine-tune and deprecate models with limited notice. Even if an API identifier stays the same, behaviour can shift. That’s a problem when your logic depends on very specific outputs or refusal patterns. The Redditor claims past workflows “ran perfectly” and are now “useless” on newer models – the specific versions aren’t verified, but the pattern is familiar to many teams.

Mitigations exist (see below), but you can’t truly freeze a black-box model whose owner can update it at will. That’s the trade-off for using managed AI services.

Hallucinations and context limits

LLMs predict text; they don’t “know” in the human sense. Hallucinations – confident-sounding but false statements – crop up more when prompts are vague, when the model is asked to infer facts beyond provided sources, or when the context window (the temporary memory the model reads) isn’t well curated. Bigger context windows help, but they don’t guarantee recall accuracy.

Non-determinism and reproducibility

Even at low temperature (a setting that reduces randomness), responses can vary. Some providers offer a seed parameter to stabilise sampling, but it’s not a silver bullet and isn’t universal. If your workflow needs exact reproducibility, unconstrained generation is a poor foundation.

Does this mean LLMs are “rotting from the inside”?

Not quite. We are likely seeing diminishing returns from scaling alone and lots of rushed productisation. But it’s also true that teams shipping LLM features without software engineering rigour are rediscovering why testing, contracts and change control matter. AI isn’t exempt from those basics – in fact, it needs more of them.

Real gains are still happening in focused niches: question answering grounded on your documents (RAG – retrieval-augmented generation), code review support, meeting summarisation, call notes, and internal search. They work when inputs are constrained, outputs are validated, and humans are kept in the loop.

UK implications: governance, audit and legal risk

For UK organisations, the governance gaps highlighted in the post are more than an annoyance – they touch data protection and accountability duties.

Automated decisions: Under UK GDPR, individuals have rights related to solely automated decisions with legal or similarly significant effects, including the right to human intervention. Build for that from day one.
Data protection and DPIAs: The ICO expects Data Protection Impact Assessments for high-risk AI uses and clear purpose limitation, data minimisation and transparency. See the ICO’s guidance on AI and data protection.
Public sector transparency: The UK Algorithmic Transparency Recording Standard sets expectations for documenting automated systems in government.
Vendor risk: Check data retention, training-on-your-data settings, model change policies and exit plans. Sign a robust data processing agreement if personal data is involved.

Useful references: ICO guidance on AI and data protection, and the Algorithmic Transparency Recording Standard.

How to build robust LLM workflows despite drift and hallucinations

1) Pin, test and stage

Version pinning: Where possible, select a specific model version, not a moving alias. Track prompts and system instructions as versioned config.
Shadow deployments: Test new models and prompts against a held-out test set before promotion. Compare accuracy, refusal rates, cost and latency.
Fallbacks: Maintain a safe fallback model and a simpler, deterministic path when confidence is low or validation fails.

2) Constrain generation and validate outputs

Low temperature and structured outputs: Use JSON or function/tool calling to force schema compliance. Validate with a strict parser.
Deterministic post-processing: Keep parsing, business rules and calculations outside the model in normal code where possible.
Guardrails: Add rule-based checks (regex, schema, whitelists) and require citations to provided sources for claims.

3) Ground the model with RAG, don’t ask it to “just know”

Retrieval first: Pull the smallest set of relevant passages, then ask the model to answer “based only on” those.
Short contexts: Pack only what’s needed. Long, noisy contexts increase error rates and cost.
Provenance: Return source links and snippets so a human can audit the answer.

Exception queues: Route low-confidence or high-impact cases to human review. Log reasons and iterate on prompts.
Sampling: Periodically sample outputs for quality and bias, especially for HR, credit or healthcare-adjacent tasks.

5) Observability and drift alerts

Metrics: Track answer accuracy (against labelled tests), refusal rate, hallucination rate (e.g., failed citation checks), latency and per-task cost.
Logging: Store prompts, model IDs, and outputs with PII minimised and access-controlled. Alert on significant metric shifts.

6) Vendor management and portability

Document dependencies: Keep a list of models, versions and features you rely on (JSON mode, tool calling, vision, etc.).
Abstraction layer: Use a lightweight interface so you can swap providers or slot in an open-weight model for specific tasks if needed.

Control levers that actually help

Lever	What it controls	Typical approach
Temperature	Randomness/creativity	0.0–0.3 for production tasks
Structured output	Schema compliance	JSON or function/tool calling + strict parser
Retrieval	Grounding in your data	RAG with top-k passages and citations
Seed (if available)	Reproducibility	Fix for tests, not a guarantee in prod
Evaluations	Quality over time	Golden datasets, regression tests, drift alerts

Where LLMs still deliver value – and where they don’t

Good bets: Drafting, summarising, knowledge retrieval with citations, code suggestions reviewed by engineers, data cleaning suggestions, customer support triage.
High risk: Solely automated decisions affecting people’s rights or finances; anything requiring strict reproducibility without human oversight.

If you’re automating office tasks and reporting via spreadsheets, start small and observable. I’ve outlined a safe pattern in my guide on connecting ChatGPT to Google Sheets – the principles (validation, logging, fallbacks) apply broadly.

Final thought: frustration justified, solutions available

The Reddit post captures something many teams feel: LLMs are powerful but fickle, and vendor changes can undo your work. That’s not a reason to abandon them; it’s a reason to treat them like probabilistic components that demand testing, constraints and governance.

If you put auditability first, keep humans in the loop, and refuse to ship “magic”, you can get steady value without the quicksand. If you need perfect accuracy and reproducibility, use conventional software – and use LLMs around the edges where they shine.

Tagged

Model Agnostic

Last updated

5 July 2026

Star Rating

No ratings yet

Comments

No comments yet - start the conversation.