Are We Betting on the Wrong Kind of AI? LLMs vs Superlearners and the Return of Reinforcement Learning

LLMs vs superlearners: are we backing the wrong kind of AI?

A recent Reddit thread asks a sharp question: are today’s large language models (LLMs) a dead end, and is the future actually in reinforcement learning (RL) “superlearners”? The prompt cites David Silver (known for AlphaGo) arguing that models trained on human data will hit a ceiling, while agents trained in simulated environments could keep discovering new knowledge by themselves.

“Current AI… might hit a ceiling because they learn from human-generated data.”

It also notes a reported raise of “around $1.1B” for a new venture pursuing this RL-first direction. That figure is from the Reddit post; official numbers are not disclosed here. Regardless of the exact sum, the idea is what matters: shift from static internet data to dynamic, trial-and-error learning at scale.

You can read and join the original discussion here: Are we betting on the wrong kind of AI? (LLMs vs superlearners).

What are LLMs actually good at – and where do they struggle?

LLMs are based on the transformer architecture, which excels at predicting the next token (piece of text) using vast corpora of human-written content. They’re brilliant at language tasks: drafting, summarising, translating, and pattern-matching across knowledge already expressed online or in documents.

The Reddit post’s critique is about data limits. If your learning comes from human text, your ceiling is “what’s been written down”, plus statistical recombination. LLMs also hallucinate (produce fluent but incorrect answers), and alignment – steering models to be safe and helpful – remains imperfect.

In short: LLMs are superb communicators and accelerators, but they don’t reliably explore new territory without human guidance or external tools.

Why reinforcement learning is back in the spotlight

Reinforcement learning (RL) is trial-and-error learning: an agent takes actions, gets rewards or penalties, and improves its policy over time. In self-play, agents improve by competing against themselves – the approach used by AlphaGo to discover strategies beyond human playbooks.

“AI learning like AlphaGo did – by playing, experimenting, failing, improving.”

Silver’s position (as described on Reddit) reframes progress as a compute-and-simulation problem, not a data-ingestion problem. If you can build rich environments and sensible reward functions, agents can generate their own experiences indefinitely and learn behaviours we haven’t thought to write down.

The case against LLM-only strategies

Finite human data: high-quality, deduplicated text is limited. Scaling laws may flatten if data quality drops.
Passive learning: predicting text is not the same as taking actions to change the world or test hypotheses.
Hard to discover novelty: LLMs remix known patterns; they’re less consistent at scientific or strategic discovery without tools or experimentation loops.

That said, LLMs can be engineered for discovery when paired with tools (code execution, search) and structured workflows. But this hybrid style is not the same as end-to-end RL in rich environments.

Are “superlearners” too risky?

Short answer: they can be. The Reddit post asks explicitly whether the RL-first route is too risky. Some well-known RL risks and trade-offs:

Reward hacking: agents find shortcuts to maximise the reward without doing what you meant.
Specifying goals is hard: if the reward is even slightly off, agents can learn undesirable strategies.
Sim-to-real gaps: behaviours that work in simulation can fail – or misbehave – in the real world.
Verification costs: you need strong evaluation, monitoring, and often human oversight to ensure safe behaviour.
Compute and energy: large-scale simulation and self-play are expensive, with cost and sustainability implications.

Still, the upside is compelling: systems that can generate their own training data, acquire skills through interaction, and discover novel solutions. The risk-reward balance depends on safeguards and governance, especially when agents act in high-stakes domains.

What this means for UK developers and organisations

Whether you sit in a startup or an established organisation, a few UK-specific implications are worth noting:

Compute and cost: large-scale RL needs serious compute. Budget for cloud spend or on-prem clusters, and watch energy and sustainability targets.
Safety and compliance: UK regulators are increasingly focused on AI safety and evaluations. RL agents acting on financial, health, or critical systems face stricter scrutiny.
Data protection: even simulated systems touch real data at the boundaries. Keep GDPR and data minimisation principles front of mind.
Talent mix: you’ll need both LLM engineers (prompting, retrieval, fine-tuning) and RL specialists (environments, reward design, evaluation) if you want to hedge bets.
Sector fit: UK strengths like finance, logistics, and life sciences already use digital twins and simulators – fertile ground for RL if you can align rewards with business KPIs.

Practical takeaways: how to hedge your AI bets today

Use LLMs where they shine now: text-heavy workflows, coding assistance, customer support, data wrangling and retrieval. They deliver immediate ROI.
Add tools and feedback loops: pair LLMs with structured evaluations and external tools (search, code, databases) to push beyond pure text prediction.
Prototype RL in sandboxed domains: start with safe, well-specified environments (ops simulations, routing, trading backtests) before touching production systems.
Invest in evaluation: regardless of approach, build automated tests, reward audits, and human-in-the-loop checks. Measure behaviour, not just benchmarks.
Expect hybrids: we’re likely to see LLMs for reasoning and interface, RL for decision-making and exploration – stitched together with strong safety layers.

If you’re focused on practical LLM productivity right now, here’s a step-by-step guide to wire models into the tools you already use: How to connect ChatGPT and Google Sheets (Custom GPT).

So, are we betting on the wrong kind of AI?

No – but we are betting on only one kind if we stick to LLMs. The Reddit post captures a real strategic fork: saturate human data, or generate infinite experience via simulation. The safe route is a portfolio approach: exploit LLMs for today’s value, and explore RL-based agents where your business can define clear rewards and afford rigorous evaluation.

“Superlearners that can discover entirely new knowledge on their own.”

If that promise holds, RL will complement – not replace – LLMs. The winning systems will likely blend both, with careful governance to keep discovery productive and safe.

Sources and further reading

Reddit discussion: Are we betting on the wrong kind of AI? (LLMs vs superlearners)
Background on self-play RL (general reference): Mastering the game of Go with deep neural networks and tree search (Nature)

Note: Funding amounts and specific startup details referenced here are taken from the Reddit post and are not independently verified.

Are We Betting on the Wrong Kind of AI? LLMs vs Superlearners and the Return of Reinforcement Learning

Joshua

Unlock exclusive content ✨

Joshua

LLMs vs superlearners: are we backing the wrong kind of AI?

What are LLMs actually good at – and where do they struggle?

Why reinforcement learning is back in the spotlight

The case against LLM-only strategies

Are “superlearners” too risky?

What this means for UK developers and organisations

Practical takeaways: how to hedge your AI bets today

So, are we betting on the wrong kind of AI?

Sources and further reading

You might also enjoy 🔍

Who Should Define AI Safety? Why Industry Self-Regulation Falls Short—and the Case for Independent Oversight in the UK

GitHub Copilot’s New Usage-Based Pricing: What the 27x Model Multipliers Mean for Your Engineering Budget

Comments 💭

Leave a Comment 💬

Got an article to share?