Álvaro de Nicolás
← All essays

Executive memo · AI realism

Myths and Merits of LLMs

Álvaro de Nicolás · June 2026

Myths and Merits of LLMs

Introduction: relevance is as important as focus

The central, counterintuitive finding is this: we are focusing on the wrong things. We are impressed by what large language models do well in tasks we find hard — writing fluent essays, generating code — and blind to their failures in tasks we find easy: basic reasoning, simple arithmetic, reliably using new information.

LLMs are not reasoning. They are Thinking Fast — using massive memorisation to predict the next word — when what enterprises need is Thinking Slow: the ability to identify structure, follow constraints, and reason consistently.

An LLM's impressive fluency is a distraction from its critical reliability failures. For enterprise use cases where accuracy is non-negotiable, a 60–70% reliability floor is catastrophic.

1. Fluency is not intelligence — the 10-million-book analogy

We are trained to equate linguistic coherence with intelligence. With AI, that instinct is a liability.

An avid human reader will get through perhaps 3,000 books in a lifetime. The LLMs you use today were trained on the equivalent of 10 to 100 million. The model is not "smart". It is an index of everything humanity has ever written. It retrieves and re-patterns that information fluently — but it does not understand it.

DimensionHuman expert (3,000 books)LLM (10–100M books)
Knowledge breadthNarrow but deepExtraordinarily broad but shallow
ReasoningGenuine causal reasoningPattern-matching mimicry
ReliabilityHigh in domain of expertise60–70% across all domains
New informationCan learn and adaptCannot reliably override training (RAG failure)
Failure modeAdmits uncertaintyConfidently wrong (hallucination)

Insight for CEOs. Do not be seduced by fluency. The question to ask of any AI output is not "does this sound right?" but "is this actually correct, and can I verify it?" Build verification into every AI-assisted process.

2. Myth one: AI can "reason" — the 70% calculator

The most dangerous myth is that LLMs perform reliable logical decision-making. They do not. They mimic reasoning they have seen in training data.

Imagine a calculator that gives you the right answer 70% of the time. That is the state of LLM reasoning today. 60–70% is impressive for summarisation. It is catastrophic for finance, logistics, or compliance, where the standard is 100%.

The two-logics test

TestQuestionLLM resultWhy it fails
The Linda problemIs Linda more likely to be (A) a bank teller, or (B) a bank teller and active in the feminist movement?Often gets it "right" — by memory, not reasoningThe puzzle is in the training data. The model retrieves the canonical answer without understanding the underlying probability.
The -ING word problemA novel variant with the same logical structureFailsThe model cannot generalise the logic to a problem it has not seen.

Insight for CEOs. LLMs can pass tests they have seen before. They cannot reliably solve novel problems. Any AI system used for decision-making must be tested on your specific, novel use cases — not on published benchmarks.

3. Myth two: AI can "find" your data — the R in RAG is broken

The industry is betting on Retrieval-Augmented Generation: point an LLM at your reports, tables and policies, and trust it to use them reliably. The research shows this is dangerously flawed.

Context vs. training conflict

LLMs struggle to prioritise new information (your data) over old training. Give the model a new fact in context — "Athens emerged as an economic power in the 4th century" — and it often reverts to what it learned in pre-training (the 6th century).

ScenarioWhat the model was givenWhat the model saidImplication
No contextNoneCorrect (from training)Default behaviour
Conflicting context"Athens emerged in the 4th century"Ignored the new factCannot reliably override training with your data

The data-silo failure

Your information lives across heterogeneous tables and documents. Ask: "What is the highest eligible free-lunch rate for K-12 schools in the most populated county in California?" A human walks four hops: county → districts → schools → rates. Standard RAG retrieves the bookends (the county text, the rates table) but misses the bridge tables that connect them.

Insight for CEOs. Standard RAG is not a reliable enterprise data solution. Before deploying any AI that reads your data, test it on multi-hop queries that require joining information from multiple sources. Most will fail.

4. Myth three: AI will solve problems it has never seen

LLMs are fundamentally interpolators, not extrapolators.

Problem typeLLM performanceWhat it tells you
From training distribution80–90%Excels on familiar patterns
Slightly modified50–60%Small changes break performance
Genuinely novel20–40%Fails at true generalisation
Multi-constraint optimisation10–30%Scheduling, routing, allocation — particularly bad

Insight for CEOs. The use cases where LLMs are most reliable — summarisation, drafting, code generation for known patterns — are exactly the use cases where they are most valuable. The use cases where you most need AI to be reliable — novel analysis, complex decisions, optimisation — are exactly where it fails.

5. A framework for safe deployment

ZoneCharacteristicsRiskApproach
GreenSummarisation, drafting, code for known patterns, contentLowDeploy with human review. Productivity gains are real and reliable.
YellowExtraction, classification, Q&A on a single documentMediumTest rigorously on your data; mandatory human verification.
RedMulti-document reasoning, optimisation, novel problem-solving, compliance decisionsHighDo not deploy standard LLMs. Specialised architectures, extensive testing, or human-in-the-loop.

Insight for CEOs. The most dangerous deployments are Red Zone use cases misclassified as Green by teams seduced by a demo. Demand empirical testing on your specific cases before any production deployment.

6. Conclusion: the Thinking Slow imperative

The LLMs available today are extraordinary tools for Thinking Fast — pattern-matching, retrieval, fluent generation. They are fundamentally unreliable for Thinking Slow — logical reasoning, constraint satisfaction, the reliable use of new information.

This is not criticism. It is a precise description of what the technology is, and is not. The organisations that win in the AI era will not be those that deploy AI everywhere. They will be those that deploy it precisely — in the Green Zone where it is reliable, with robust oversight in the Yellow Zone, and with appropriate scepticism about the Red Zone until the technology matures.

Thinking Fast (LLM strength)Thinking Slow (human strength)
Pattern recognition and retrievalCausal reasoning and logic
Fluent text generationNovel problem-solving
Summarisation and synthesisConstraint satisfaction
Code for known patternsReliable use of new information
Content creation and draftingHigh-stakes decision-making
Before deploying any LLM in production, ask three questions. (1) Is this a Green Zone use case? (2) Have we tested it on our own data, not just published benchmarks? (3) Do we have a verification workflow for outputs where errors would be costly? If any answer is no, do not deploy.

Memo prepared by Álvaro de Nicolás · June 2026. For board and executive use.