Karpathy vs Sutton: Are LLMs 'Summoning Ghosts' or Building Animals?

Karpathy vs Sutton on the “Bitter Lesson”: are LLMs summoning ghosts or building animals?

Andrej Karpathy has a new line that’s making the rounds: large language model (LLM) research isn’t about “building animals” that learn from the world – it’s about “summoning ghosts”, distilled from human text and engineering. The phrase comes from his commentary on Rich Sutton’s long-standing “Bitter Lesson”, and a recent podcast exchange where Sutton sketched an alternative path to intelligence.

Below is a plain-English walkthrough of the argument, why it matters, and what it means for teams in the UK deciding where to place their bets.

What is the “Bitter Lesson” and why LLMs may not fit it

Rich Sutton’s Bitter Lesson is often taken as a north star in AI: methods that scale with compute and data ultimately beat hand-designed systems. Many in the LLM world consider transformers and scaling laws to be the poster child of that idea.

Karpathy highlights that Sutton himself is sceptical that current LLMs are truly “bitter-lesson-pilled”. The core critique: today’s LLMs are trained on a finite, human-generated corpus and then further shaped by human-curated fine-tuning and reinforcement learning choices. That human dependency limits purity of the “just add compute” paradigm.

“LLM research is not about building animals. It is about summoning ghosts.”

Sutton’s “animal” view: agents that learn through interaction

In Sutton’s framing, we should build a “child machine” that learns from experience, not from internet-scale imitation. No giant pretraining step, no supervised fine-tuning that “teleoperates” behaviour. The focus is reinforcement learning (RL) – agents act in an environment, receive rewards, and continually update.

Reinforcement learning (RL): learning via trial and error to maximise reward, not by copying labels in a dataset.
Intrinsic motivation: signals like curiosity or prediction quality that drive learning even without external rewards.
Always learning: agents should keep learning at test time, rather than train once and deploy statically.

Sutton argues LLM pipelines inject human bias at multiple stages. AlphaZero beating AlphaGo is used as an analogy: systems that learn directly from interaction can surpass those initialised from human data.

“If we understood a squirrel, we’d be almost done.”

Karpathy’s “ghosts”: practical, engineered, human-shaped intelligence

Karpathy agrees that frontier LLMs are not a “clean” bitter-lesson algorithm. But he makes a pragmatic case for pretraining as a practical answer to the cold-start problem. We don’t have evolutionary timescales or safe, open-ended worlds to learn from scratch; we do have the internet.

“Pretraining is our crappy evolution.”

Pretraining on human text gives billions of parameters a useful starting point, after which more “animal-like” learning (e.g. RL) can refine behaviour. He suggests LLMs might evolve towards Sutton’s agents – or they may remain a distinct species of intelligence: still world-changing, but fundamentally different.

Key terms explained

Transformer: the neural network architecture behind modern LLMs, using attention to model relationships in sequences.
Pretraining: unsupervised learning on large text corpora to predict the next token, building general language capability.
Supervised fine-tuning: training on labelled examples (often human-curated) to make models follow instructions.
Reinforcement learning (RL): learning via rewards from interacting with an environment.
“Bitter-lesson-pilled”: a tongue-in-cheek way of saying an approach benefits from scaling compute and data without heavy hand-engineering.

Why this debate matters to UK developers and organisations

LLMs’ reliance on human text raises questions under UK data protection. If your use case involves personal data or sensitive content, you’ll need to know where data came from, what lawful basis applies, and how outputs might encode bias. Data provenance, DPIAs, and robust prompt/output logging are not optional if you’re in regulated sectors.

Running out of data vs real-world learning

Karpathy relays Sutton’s concern: human text is finite. If progress depends on scale, what happens when we hit the ceiling? One path is richer interaction data (simulations, enterprise workflows, user feedback). Another is to push RL-style continual learning – but that raises safety, privacy, and governance challenges, especially in public services.

Cost, compute, and environmental impact

The “ghost” approach is compute-heavy. For UK teams, cloud costs, latency, and energy considerations are not trivial. A more “animal” approach – learning by doing within your own environment – may reduce dependence on ever-larger base models, but requires careful environment design, reward shaping, and safety controls.

Practical takeaways: choosing ghosts, animals, or a hybrid

When “ghosts” shine

Knowledge work, summarisation, drafting, and analysis where you want a distilled sense of human writing.
Fast time-to-value using existing models plus light fine-tuning or prompt engineering.
Use cases requiring predictable behaviour shaped by human preferences.

When to push for “animal” qualities

Interactive domains where learning from experience beats imitation (e.g. operations optimisation, simulations, self-play).
Continuous adaptation is a feature, not a bug – you want the system to learn on the job.
You can design safe environments and reward functions, and tolerate exploration.

A realistic middle ground for 2025

Start with a strong pretrained base for capability and safety.
Layer in feedback and RL where interaction data is available and risks are controlled.
Instrument for privacy and compliance from day one: data minimisation, redaction, audit trails, and human-in-the-loop.

Strategy questions for UK teams

Data provenance: can you document your training and fine-tuning data sources against UK GDPR requirements?
Continual learning: do you want the model to update in production? If yes, how will you ensure safety and governance?
Bias and fairness: how will you measure and mitigate demographic and domain biases baked into human text?
Cost control: what’s your plan for model selection, caching, and workload offloading to keep costs predictable?
Value capture: where does interaction data (clicks, corrections, rewards) flow, and how do you turn it into improvements?

Bottom line: inspiration from animals, utility from ghosts

Sutton is a useful corrective to LLM hubris: intelligence that learns from the world, with intrinsic motivation and continual adaptation, remains the long game. Karpathy’s counter is equally pragmatic: in industry reality, “ghosts” give us leverage now, and we can steer them towards more agentic behaviour where it makes sense.

If you’re building today, start with the tools that work, then iterate towards more interactive learning where it clearly adds value and you can manage the risks. For practical workflow wins, see my guide to connecting ChatGPT to Google Sheets – the kind of “ghost” that quietly pays for itself.