Introduction: relevance is as important as focus
The central, counterintuitive finding is this: we are focusing on the wrong things. We are impressed by what large language models do well in tasks we find hard — writing fluent essays, generating code — and blind to their failures in tasks we find easy: basic reasoning, simple arithmetic, reliably using new information.
LLMs are not reasoning. They are Thinking Fast — using massive memorisation to predict the next word — when what enterprises need is Thinking Slow: the ability to identify structure, follow constraints, and reason consistently.
An LLM's impressive fluency is a distraction from its critical reliability failures. For enterprise use cases where accuracy is non-negotiable, a 60–70% reliability floor is catastrophic.
1. Fluency is not intelligence — the 10-million-book analogy
We are trained to equate linguistic coherence with intelligence. With AI, that instinct is a liability.
An avid human reader will get through perhaps 3,000 books in a lifetime. The LLMs you use today were trained on the equivalent of 10 to 100 million. The model is not "smart". It is an index of everything humanity has ever written. It retrieves and re-patterns that information fluently — but it does not understand it.
| Dimension | Human expert (3,000 books) | LLM (10–100M books) |
|---|---|---|
| Knowledge breadth | Narrow but deep | Extraordinarily broad but shallow |
| Reasoning | Genuine causal reasoning | Pattern-matching mimicry |
| Reliability | High in domain of expertise | 60–70% across all domains |
| New information | Can learn and adapt | Cannot reliably override training (RAG failure) |
| Failure mode | Admits uncertainty | Confidently wrong (hallucination) |
Insight for CEOs. Do not be seduced by fluency. The question to ask of any AI output is not "does this sound right?" but "is this actually correct, and can I verify it?" Build verification into every AI-assisted process.
2. Myth one: AI can "reason" — the 70% calculator
The most dangerous myth is that LLMs perform reliable logical decision-making. They do not. They mimic reasoning they have seen in training data.
Imagine a calculator that gives you the right answer 70% of the time. That is the state of LLM reasoning today. 60–70% is impressive for summarisation. It is catastrophic for finance, logistics, or compliance, where the standard is 100%.
The two-logics test
| Test | Question | LLM result | Why it fails |
|---|---|---|---|
| The Linda problem | Is Linda more likely to be (A) a bank teller, or (B) a bank teller and active in the feminist movement? | Often gets it "right" — by memory, not reasoning | The puzzle is in the training data. The model retrieves the canonical answer without understanding the underlying probability. |
| The -ING word problem | A novel variant with the same logical structure | Fails | The model cannot generalise the logic to a problem it has not seen. |
Insight for CEOs. LLMs can pass tests they have seen before. They cannot reliably solve novel problems. Any AI system used for decision-making must be tested on your specific, novel use cases — not on published benchmarks.
3. Myth two: AI can "find" your data — the R in RAG is broken
The industry is betting on Retrieval-Augmented Generation: point an LLM at your reports, tables and policies, and trust it to use them reliably. The research shows this is dangerously flawed.
Context vs. training conflict
LLMs struggle to prioritise new information (your data) over old training. Give the model a new fact in context — "Athens emerged as an economic power in the 4th century" — and it often reverts to what it learned in pre-training (the 6th century).
| Scenario | What the model was given | What the model said | Implication |
|---|---|---|---|
| No context | None | Correct (from training) | Default behaviour |
| Conflicting context | "Athens emerged in the 4th century" | Ignored the new fact | Cannot reliably override training with your data |
The data-silo failure
Your information lives across heterogeneous tables and documents. Ask: "What is the highest eligible free-lunch rate for K-12 schools in the most populated county in California?" A human walks four hops: county → districts → schools → rates. Standard RAG retrieves the bookends (the county text, the rates table) but misses the bridge tables that connect them.
Insight for CEOs. Standard RAG is not a reliable enterprise data solution. Before deploying any AI that reads your data, test it on multi-hop queries that require joining information from multiple sources. Most will fail.
4. Myth three: AI will solve problems it has never seen
LLMs are fundamentally interpolators, not extrapolators.
| Problem type | LLM performance | What it tells you |
|---|---|---|
| From training distribution | 80–90% | Excels on familiar patterns |
| Slightly modified | 50–60% | Small changes break performance |
| Genuinely novel | 20–40% | Fails at true generalisation |
| Multi-constraint optimisation | 10–30% | Scheduling, routing, allocation — particularly bad |
Insight for CEOs. The use cases where LLMs are most reliable — summarisation, drafting, code generation for known patterns — are exactly the use cases where they are most valuable. The use cases where you most need AI to be reliable — novel analysis, complex decisions, optimisation — are exactly where it fails.
5. A framework for safe deployment
| Zone | Characteristics | Risk | Approach |
|---|---|---|---|
| Green | Summarisation, drafting, code for known patterns, content | Low | Deploy with human review. Productivity gains are real and reliable. |
| Yellow | Extraction, classification, Q&A on a single document | Medium | Test rigorously on your data; mandatory human verification. |
| Red | Multi-document reasoning, optimisation, novel problem-solving, compliance decisions | High | Do not deploy standard LLMs. Specialised architectures, extensive testing, or human-in-the-loop. |
Insight for CEOs. The most dangerous deployments are Red Zone use cases misclassified as Green by teams seduced by a demo. Demand empirical testing on your specific cases before any production deployment.
6. Conclusion: the Thinking Slow imperative
The LLMs available today are extraordinary tools for Thinking Fast — pattern-matching, retrieval, fluent generation. They are fundamentally unreliable for Thinking Slow — logical reasoning, constraint satisfaction, the reliable use of new information.
This is not criticism. It is a precise description of what the technology is, and is not. The organisations that win in the AI era will not be those that deploy AI everywhere. They will be those that deploy it precisely — in the Green Zone where it is reliable, with robust oversight in the Yellow Zone, and with appropriate scepticism about the Red Zone until the technology matures.
| Thinking Fast (LLM strength) | Thinking Slow (human strength) |
|---|---|
| Pattern recognition and retrieval | Causal reasoning and logic |
| Fluent text generation | Novel problem-solving |
| Summarisation and synthesis | Constraint satisfaction |
| Code for known patterns | Reliable use of new information |
| Content creation and drafting | High-stakes decision-making |
Before deploying any LLM in production, ask three questions. (1) Is this a Green Zone use case? (2) Have we tested it on our own data, not just published benchmarks? (3) Do we have a verification workflow for outputs where errors would be costly? If any answer is no, do not deploy.
Memo prepared by Álvaro de Nicolás · June 2026. For board and executive use.