Myths and Merits of LLMs

Introduction: relevance is as important as focus

The central, counterintuitive finding is this: we are focusing on the wrong things. We are impressed by what large language models do well in tasks we find hard — writing fluent essays, generating code — and blind to their failures in tasks we find easy: basic reasoning, simple arithmetic, reliably using new information.

LLMs are not reasoning. They are Thinking Fast — using massive memorisation to predict the next word — when what enterprises need is Thinking Slow: the ability to identify structure, follow constraints, and reason consistently.

An LLM's impressive fluency is a distraction from its critical reliability failures. For enterprise use cases where accuracy is non-negotiable, a 60–70% reliability floor is catastrophic.

1. Fluency is not intelligence — the 10-million-book analogy

We are trained to equate linguistic coherence with intelligence. With AI, that instinct is a liability.

An avid human reader will get through perhaps 3,000 books in a lifetime. The LLMs you use today were trained on the equivalent of 10 to 100 million. The model is not "smart". It is an index of everything humanity has ever written. It retrieves and re-patterns that information fluently — but it does not understand it.

Dimension	Human expert (3,000 books)	LLM (10–100M books)
Knowledge breadth	Narrow but deep	Extraordinarily broad but shallow
Reasoning	Genuine causal reasoning	Pattern-matching mimicry
Reliability	High in domain of expertise	60–70% across all domains
New information	Can learn and adapt	Cannot reliably override training (RAG failure)
Failure mode	Admits uncertainty	Confidently wrong (hallucination)

Insight for CEOs. Do not be seduced by fluency. The question to ask of any AI output is not "does this sound right?" but "is this actually correct, and can I verify it?" Build verification into every AI-assisted process.

2. Myth one: AI can "reason" — the 70% calculator

The most dangerous myth is that LLMs perform reliable logical decision-making. They do not. They mimic reasoning they have seen in training data.

Imagine a calculator that gives you the right answer 70% of the time. That is the state of LLM reasoning today. 60–70% is impressive for summarisation. It is catastrophic for finance, logistics, or compliance, where the standard is 100%.

The two-logics test

Test	Question	LLM result	Why it fails
The Linda problem	Is Linda more likely to be (A) a bank teller, or (B) a bank teller and active in the feminist movement?	Often gets it "right" — by memory, not reasoning	The puzzle is in the training data. The model retrieves the canonical answer without understanding the underlying probability.
The -ING word problem	A novel variant with the same logical structure	Fails	The model cannot generalise the logic to a problem it has not seen.

Insight for CEOs. LLMs can pass tests they have seen before. They cannot reliably solve novel problems. Any AI system used for decision-making must be tested on your specific, novel use cases — not on published benchmarks.

3. Myth two: AI can "find" your data — the R in RAG is broken

The industry is betting on Retrieval-Augmented Generation: point an LLM at your reports, tables and policies, and trust it to use them reliably. The research shows this is dangerously flawed.

Context vs. training conflict

LLMs struggle to prioritise new information (your data) over old training. Give the model a new fact in context — "Athens emerged as an economic power in the 4th century" — and it often reverts to what it learned in pre-training (the 6th century).

Scenario	What the model was given	What the model said	Implication
No context	None	Correct (from training)	Default behaviour
Conflicting context	"Athens emerged in the 4th century"	Ignored the new fact	Cannot reliably override training with your data

The data-silo failure

Your information lives across heterogeneous tables and documents. Ask: "What is the highest eligible free-lunch rate for K-12 schools in the most populated county in California?" A human walks four hops: county → districts → schools → rates. Standard RAG retrieves the bookends (the county text, the rates table) but misses the bridge tables that connect them.

Insight for CEOs. Standard RAG is not a reliable enterprise data solution. Before deploying any AI that reads your data, test it on multi-hop queries that require joining information from multiple sources. Most will fail.

4. Myth three: AI will solve problems it has never seen

LLMs are fundamentally interpolators, not extrapolators.

Problem type	LLM performance	What it tells you
From training distribution	80–90%	Excels on familiar patterns
Slightly modified	50–60%	Small changes break performance
Genuinely novel	20–40%	Fails at true generalisation
Multi-constraint optimisation	10–30%	Scheduling, routing, allocation — particularly bad

Insight for CEOs. The use cases where LLMs are most reliable — summarisation, drafting, code generation for known patterns — are exactly the use cases where they are most valuable. The use cases where you most need AI to be reliable — novel analysis, complex decisions, optimisation — are exactly where it fails.

5. A framework for safe deployment

Zone	Characteristics	Risk	Approach
Green	Summarisation, drafting, code for known patterns, content	Low	Deploy with human review. Productivity gains are real and reliable.
Yellow	Extraction, classification, Q&A on a single document	Medium	Test rigorously on your data; mandatory human verification.
Red	Multi-document reasoning, optimisation, novel problem-solving, compliance decisions	High	Do not deploy standard LLMs. Specialised architectures, extensive testing, or human-in-the-loop.

Insight for CEOs. The most dangerous deployments are Red Zone use cases misclassified as Green by teams seduced by a demo. Demand empirical testing on your specific cases before any production deployment.

6. Conclusion: the Thinking Slow imperative

The LLMs available today are extraordinary tools for Thinking Fast — pattern-matching, retrieval, fluent generation. They are fundamentally unreliable for Thinking Slow — logical reasoning, constraint satisfaction, the reliable use of new information.

This is not criticism. It is a precise description of what the technology is, and is not. The organisations that win in the AI era will not be those that deploy AI everywhere. They will be those that deploy it precisely — in the Green Zone where it is reliable, with robust oversight in the Yellow Zone, and with appropriate scepticism about the Red Zone until the technology matures.

Thinking Fast (LLM strength)	Thinking Slow (human strength)
Pattern recognition and retrieval	Causal reasoning and logic
Fluent text generation	Novel problem-solving
Summarisation and synthesis	Constraint satisfaction
Code for known patterns	Reliable use of new information
Content creation and drafting	High-stakes decision-making

Before deploying any LLM in production, ask three questions. (1) Is this a Green Zone use case? (2) Have we tested it on our own data, not just published benchmarks? (3) Do we have a verification workflow for outputs where errors would be costly? If any answer is no, do not deploy.

Memo prepared by Álvaro de Nicolás · June 2026. For board and executive use.