Is ChatGPT falling behind other AIs? What a developer’s Reddit post tells us
A recent thread asks a pointed question: is ChatGPT slipping compared to Google’s Gemini – especially for coding? The poster reports slower responses, more “imagination” (hallucinations), and – crucially for developers – a smaller context window that makes large code tasks harder.
“It just feels like it’s falling behind especially to Google.”
You can read the original discussion here: Reddit thread. Below I summarise the concerns, why they might be happening, and what UK developers and teams can do today.
Speed, accuracy and coding performance: what might be going on
The Redditor highlights three pain points: speed, accuracy, and context window limits for coding tasks. These are common friction points when you push large, complex prompts through any general-purpose model. A few likely contributors:
- Load and latency – Response time varies by provider, model, time of day, and demand. High-traffic periods can slow things down.
- Prompt shape – Long, unstructured inputs tend to be slower and less accurate. Models do better with clear instructions and scoped context.
- Context limits – A model’s “context window” is how much it can consider at once. If you exceed it, the model truncates or drops detail, hurting performance.
- Guardrails and alignment – Stronger safety filters can feel like refusal or “overcautious” behaviour on some tasks.
| Dimension | What the Redditor reports | Why it happens | What to try |
|---|---|---|---|
| Speed | “Slow processing” | Server load, large prompt size, tool calls | Reduce input size, batch tasks, run off-peak, cache/reuse summaries |
| Accuracy | “Inaccurate information” and “increased imagination” | Hallucinations under ambiguous or broad prompts | Constrain scope, demand citations, use retrieval (RAG) for source-grounded answers |
| Context window | “Small context window” for codebases | Hard limit on tokens considered at once | Chunk code, provide repo maps, use file-by-file reviews, adopt RAG/search |
| Token costs | Not discussed | Varies by provider, model and usage | Monitor spend, compress prompts, prefer diffs over full files |
Context windows and long codebases: practical workarounds
The heart of the post is coding with long files or whole apps. If your model can’t hold the entire codebase, you’ll see truncation, missed references and more back-and-forth. Workarounds that help regardless of model:
- Start with a repo map – Give the model a high-level inventory: folders, key files, entry points, dependencies. Keep this short and structured.
- Work in scoped slices – Ask for a plan first. Then iterate file-by-file or component-by-component. Provide only the relevant snippets.
- Use diffs and interfaces – Instead of pasting entire files, share the public interfaces and the diff you want. It reduces tokens and errors.
- Retrieval augmented generation (RAG) – Keep your code indexed in a vector store or searchable knowledge base. Let the model “look up” only what it needs.
- Unit tests as the contract – Paste tests and ask the model to make the code pass. Tests reduce ambiguity and improve correctness.
- Ask for citations within your repo – Request line references to ensure the model is grounding changes in the right places.
These patterns narrow the problem and squeeze more value out of any context window. They also make results easier to review in a proper code workflow.
How to benchmark ChatGPT vs Gemini fairly in 2025
Comparisons can easily go sideways if the setup isn’t controlled. A simple, fair approach:
- Use the same, minimal prompt across models. Avoid vendor-specific features unless you test like-for-like.
- Test fresh sessions to avoid hidden memory. Note cold vs warm start.
- Measure latency from send to first token and to completion.
- Score accuracy against a ground truth: unit tests, docs, or known outputs.
- Record hallucinations and refusals as separate metrics.
- Keep a change log of prompts so you can reproduce results.
For current limits and pricing, see the official docs rather than social summaries:
UK perspective: privacy, compliance, costs and availability
For UK organisations, the “best” model isn’t only about quality. It’s also about governance and cost control:
- Data protection and GDPR – Confirm whether prompts and outputs are used for training by default, and get a data processing addendum (DPA) from your vendor. See the providers’ policies: OpenAI policies, Google data governance.
- Data residency – Check where data is processed and stored. Some sectors (finance, healthcare, public sector) have stricter requirements.
- Access controls and logging – Ensure audit trails, SSO and role-based access if you use models with production or sensitive data.
- Cost management – Large prompts and long outputs drive spend. Monitor token usage, set budgets, and use summaries/diffs to keep tokens down.
- Availability in the UK – Both providers operate here, but features and models can roll out at different times. Verify availability for your account type.
My take: pick by task, not by brand
The Reddit post captures a real shift many developers feel: when your task is “digest this huge app and fix it”, the model with the larger usable context and better retrieval tends to feel smarter. On small, well-scoped tasks, differences often narrow.
“I can give Gemini complete app… ChatGPT… won’t be able to process one file without removing stuff.”
Rather than debating winners, treat models as tools. Use the one that fits the workload – and don’t be shy about a multi-model approach. Keep both in your toolbox, standardise your prompts, add retrieval where it matters, and measure outcomes over impressions.
Key metrics at a glance
| Metric | ChatGPT (varies by model) | Gemini (varies by model) | Notes |
|---|---|---|---|
| Context window | Not disclosed here | Not disclosed here | Check provider docs for current limits |
| Latency | Depends on load and prompt size | Depends on load and prompt size | Benchmark in your environment |
| Token costs | Varies by model | Varies by model | See pricing pages for up-to-date rates |
Quick wins you can try today
- Restructure your prompts: plan first, then iterate in small, testable steps.
- Adopt RAG: index your code/docs and retrieve only what’s needed.
- Use tests as the source of truth and ask the model to satisfy them.
- Measure, don’t guess: log latency, correctness and rework across models.
- Automate routine handoffs: if you use Sheets or internal data, see my guide on connecting ChatGPT with Google Sheets via a custom GPT.
Bottom line
The Reddit thread surfaces a genuine pain point for developers working with large contexts. Whether you favour ChatGPT or Gemini, the winning setup is usually the one that reduces context size, adds retrieval, and tests outputs. For UK teams balancing quality, privacy and cost, that combination matters more than the logo on the model card.