What did the Harvard BCG study actually find about AI use among consultants?

Across 244 consultants, 5,000 AI conversations, and 237 interviews, 87% were using AI in ways that made them measurably worse at their job while feeling more productive. Only the 14% Centaurs (who used AI for background tasks and kept the analysis themselves) got sharper. The rest trained themselves out of the judgment they were being paid for.

Why does AI confidently give wrong strategic answers?

Frontier models aggregate, they do not reason. Harvard Business Review tested Claude, ChatGPT and Gemini across 15,000 strategy scenarios. The models recommended trendy strategies regardless of context. Better prompts moved accuracy 2%. Rich context moved it 11%. Flipping option order on the page moved the recommendation 19%. The order mattered more than the substance.

Are the AI productivity numbers being quoted at offsites real?

Usually not. METR's randomised controlled trial on 16 experienced developers found they felt 20% faster with AI and were measured 19% slower. Duke and Federal Reserve studies show the same gap. If a productivity number came from a survey rather than a stopwatch, treat it as feeling, not fact.

Why does AI work at Citadel but stall at most mid-market firms?

Citadel did not buy ChatGPT licences. They built one of the largest proprietary financial datasets in the world, then years of infrastructure on top of it, inside a narrow domain with trackable win/loss states. AI is the interface, the dataset is the intelligence. Pull any of those legs (narrow domain, proprietary data, validated answers, expert humans) and the productivity story collapses.

What should an owner-operator actually do this quarter?

Inventory every AI tool already running. Map who bought it, which data it touches, which sub-processor sits behind it, and whether anyone outside the team that bought it has reviewed the output for accuracy. Then name one senior person with cross-functional authority who owns the Kill, Fix, Build verdict. Not a committee. The person whose name goes next to the call.

How does the 2028 UAE Agentic AI mandate change this?

The federal directive of 23 April 2026 sets 50% of federal sectors on Agentic AI by April 2028. M2 extended the same 24-month clock to Dubai's private sector on 4 May 2026, tying licensing and council participation to adoption. A meaningful share of B2G interactions is expected to be algorithm-to-algorithm on that trajectory. Firms running a stack of unmanaged tools without a sovereign data layer are unlikely to have a defensible position by then.

87% Of Consultants Got Worse At Their Job With AI

Josef Holm7 min readMay 31, 2026

Key Takeaways

Harvard sat 244 BCG consultants down with 5,000 AI conversations and 237 interviews. 87% were using AI in ways that made them measurably worse at the job while feeling like the tools were helping. They were training themselves out of the judgment that made them worth paying for.
Frontier models aggregate, they do not reason. HBR's test across 15,000 strategy scenarios found the order options appeared on the page moved the recommendation 19%, more than richer prompts (11%) or telling the model to reason carefully (almost nothing). The confidence looks the same whether the answer came from your data or the model's bias.
METR's randomised trial on 16 experienced developers: predicted 24% faster, felt 20% faster, measured 19% slower. If the productivity number on your AI rollout came from a survey rather than a stopwatch, it is a feeling, not a fact. Duke and the Federal Reserve keep finding the same shape.
Citadel works because the domain is narrow, the data is proprietary, the answers are validated, and the humans around the system are experts. Pull any of those four legs and the productivity story falls over. Most mid-market firms are buying the opposite: marketing's credit card tool, HR's unreviewed screener, finance's nine-month-old pilot, legal's public-model paste.
The 14% who got better in the Harvard study did one thing the other 87% did not. They kept thinking. The first move before the next board meeting is the inventory most firms cannot produce in under a week.

Harvard sat down with 244 BCG consultants, logged nearly 5,000 AI conversations, and ran 237 hour-long interviews on top. The finding nobody is putting on a board slide: 87% of the consultants were using AI in ways that made them measurably worse at their actual job, while feeling like the tools were helping.

This is the operator question hiding inside every "AI productivity" pitch on your desk. Not whether AI works. Whether the way your team is using it is quietly eroding the expertise you pay them for.

What Harvard Actually Found

Three patterns emerged across the 244 consultants. Each was given the same fictional retail strategy problem: three brands, financial data, interview notes, a CEO asking which brand to back for growth.

The Cyborgs (around 60%) fed everything into AI and iterated. They weren't naive about it. They asked it to double-check itself. But the AI was the source of truth at every step. They got better at AI. They got worse at understanding the business.

The Centaurs (around 14%) used AI for background tasks. Research on category trends. The right Excel formula for compound annual growth rate. The analysis and the argument stayed theirs. They got sharper as professionals. Their AI skills barely moved.

The Self-Automators (around 27%) pasted the entire problem into a single prompt, took the answer, moved on. One participant dumped every interview transcript and every financial table at once and asked for the recommendation, the rationale, and the memo in one shot. They gained nothing in either direction.

The easy read is "be a Centaur". The harder read is that 87% of the trained, expensive, intelligent people in this study were developing AI fluency that did not produce more accurate answers. They weren't training their replacements. They were training themselves out of the judgment that made them worth paying for in the first place.

Why The Tool Confidently Tells You The Wrong Thing

Frontier models don't reason. They aggregate. They pull together what is already known and present it with confidence. That is a different job from what your senior people do.

Harvard Business Review tested every major frontier model (Claude, ChatGPT, Gemini) across 15,000 strategy scenarios. The tools consistently recommended trendy strategies regardless of the actual context of the question. Prompting fixes barely moved the bias. Clearer prompts: 2% improvement. Rich context with unique details: 11%. Telling the model to reason more carefully: almost nothing. Flipping which option appeared first on the page: a 19% swing in the recommendation. The order on the page moved the answer more than the substance of the question.

A senior operator is structurally unable to easily verify, from inside unmanaged workflows, whether the AI's answer came from the firm's data or from the model's bias toward whatever is trending. The output looks the same either way. That is the exposure surface most owner-operators are underwriting without naming it.

The Productivity Number You Are Probably Quoting Is Wrong

Research nonprofit METR ran a randomised controlled trial on 16 experienced developers, half with AI access, half without. The developers predicted AI would make them 24% faster. After the study they reported feeling 20% faster. Measurement showed they were 19% slower. Independent studies from Duke and the Federal Reserve Banks of Richmond and Atlanta keep finding the same shape. Workers feel faster than they actually are.

If you're running an executive offsite and someone quotes a productivity gain from internal AI rollout, the question to ask is whether that number came from a stopwatch or a survey. Most of the numbers in circulation right now came from surveys.

Citadel Works. Why Most Firms Struggle To Copy It On Current Trajectory.

Ken Griffin at Davos in January called AI "all garbage when you dig below the surface" and framed the projected $500B in 2026 data centre spending as hype pushed to justify capex. At the Stanford Leadership Forum a few months later he reversed. He described a "step change function" in productivity. Work that used to take master's and PhD finance people weeks or months was being done by AI agents in hours or days.

The mainstream read is "even Griffin came around". The operator read is harder. Citadel didn't buy ChatGPT licences. They spent decades building one of the largest proprietary financial datasets in the world, then years more building infrastructure on top of it. The work happens inside a very narrow domain with trackable data points and clear win/loss states. AI is the interface to that data. The intelligence is the dataset, the validation, and the experts around the system.

A Stanford and MIT study of 5,179 customer support agents at a Fortune 500 company found the same shape from the other end. AI access lifted productivity 14% on average. Novices got a 34% boost. Experienced workers got almost nothing. The domain was narrow. The right answers were known. The AI surfaced them. It did not figure them out.

This is the operator pattern worth naming. AI as aggregator, not analyst. It works when the domain is narrow, the data is proprietary, the answers are validated, and the humans around the system are experts. Pull any of those four legs and the productivity story falls over.

What Most Mid-Market Firms Are Actually Buying

The mid-market private firm pattern looks nothing like Citadel. Marketing bought a generative tool on a credit card. HR is running a screening tool nobody in IT reviewed. Finance trialled a predictive analytics platform that is still sitting on the SaaS bill nine months later. Legal pasted draft mandate language into a public model last quarter. Nobody has a list. Nobody has reviewed the sub-processors. The DPO is, on paper, accountable for systems she does not know exist.

This is the missing decision layer. In Josef Holm's positioning shorthand it is F3, zombie pilots and Shadow AI. In plain operator terms it's the SaaS line your CFO can read and cannot defend. The question worth asking inside the firm this quarter is not "what should our AI strategy be". The question is "which of the four AI tools we are already paying for survives a Kill, Fix, Build verdict by someone qualified to make one".

The 2028 Compounding Problem

The Harvard finding compounds badly under the trajectory the UAE federal directive points to. On 23 April 2026, Federal Agentic AI Directive M1 set the target: 50% of federal sectors, services, and operations shifted to Agentic AI models by April 2028. Procurement, tax auditing, customer happiness, and technical support agents were named as the first deployments. Concurrent with the directive, a government services digital records policy was approved, legally establishing digital records as the single official source of core data the agents would read from.

Read that order carefully. The federal government deployed the data policy before the agents. They acknowledged, in writing, that autonomous agents do not work on fragmented data.

The private sector did not get the same hint. On 4 May 2026, M2 extended the same 24-month clock to Dubai's entire private sector, with the Dubai Chamber of Commerce administering training tracks and tying future commercial licensing, council participation, and access to state-backed capital to demonstrated progress on Agentic AI adoption. Throughput and data sovereignty land on the same page. A meaningful share of B2G interactions by 2028 (corporate tax filings, procurement bids, customs declarations, compliance audits) is expected to be tied to algorithm-to-algorithm processing by federal AI agents rather than human bureaucrats. Firms whose internal AI is a stack of unmanaged tools without a sovereign data layer underneath are highly unlikely to possess a defensible AI strategy on that timeline. They have a SaaS bill.

What To Do Before The Next Board Meeting

The path forward is not better prompting. It is deeper expertise in a narrow domain, a clear-eyed view of what AI can and cannot do inside that domain, and proprietary data and judgment the model is asked to apply rather than generate. That is the Centaur position at firm level.

The first move is operational, not strategic. Inventory every AI tool already running in the firm. Map which department bought it, which data it touches, which sub-processor sits behind it, and whether anyone outside the originating team has ever reviewed the output for accuracy. Most firms can't produce that list in under a week. That gap is the latent AI Fragmentation across your pipeline, and it's the gap an autonomous agent runs into on day one.

The second move is to name a decision layer. Someone senior, with cross-functional authority, who owns the Kill, Fix, Build call on each tool and each workflow. Not a committee. Not the IT helpdesk. The person whose name goes next to the verdict.

If the firm does not have that person internally and the 2028 clock is real on your trajectory, that is the gap an AI readiness note is built to close. Fixed scope, bounded, an Opportunity Map at the end of it that names what to kill, what to fix, what to build, and what the governance line around the survivors looks like.

The firms that get this right won't be the ones with the most AI tools. They'll be the ones who treated AI as use on top of expertise they already had. The 14% who got better in the Harvard study did one thing the other 87% did not. They kept thinking.

Infographic

Frequently Asked Questions

What did the Harvard BCG study actually find about AI use among consultants?: Across 244 consultants, 5,000 AI conversations, and 237 interviews, 87% were using AI in ways that made them measurably worse at their job while feeling more productive. Only the 14% Centaurs (who used AI for background tasks and kept the analysis themselves) got sharper. The rest trained themselves out of the judgment they were being paid for.
Why does AI confidently give wrong strategic answers?: Frontier models aggregate, they do not reason. Harvard Business Review tested Claude, ChatGPT and Gemini across 15,000 strategy scenarios. The models recommended trendy strategies regardless of context. Better prompts moved accuracy 2%. Rich context moved it 11%. Flipping option order on the page moved the recommendation 19%. The order mattered more than the substance.
Are the AI productivity numbers being quoted at offsites real?: Usually not. METR's randomised controlled trial on 16 experienced developers found they felt 20% faster with AI and were measured 19% slower. Duke and Federal Reserve studies show the same gap. If a productivity number came from a survey rather than a stopwatch, treat it as feeling, not fact.
Why does AI work at Citadel but stall at most mid-market firms?: Citadel did not buy ChatGPT licences. They built one of the largest proprietary financial datasets in the world, then years of infrastructure on top of it, inside a narrow domain with trackable win/loss states. AI is the interface, the dataset is the intelligence. Pull any of those legs (narrow domain, proprietary data, validated answers, expert humans) and the productivity story collapses.
What should an owner-operator actually do this quarter?: Inventory every AI tool already running. Map who bought it, which data it touches, which sub-processor sits behind it, and whether anyone outside the team that bought it has reviewed the output for accuracy. Then name one senior person with cross-functional authority who owns the Kill, Fix, Build verdict. Not a committee. The person whose name goes next to the call.
How does the 2028 UAE Agentic AI mandate change this?: The federal directive of 23 April 2026 sets 50% of federal sectors on Agentic AI by April 2028. M2 extended the same 24-month clock to Dubai's private sector on 4 May 2026, tying licensing and council participation to adoption. A meaningful share of B2G interactions is expected to be algorithm-to-algorithm on that trajectory. Firms running a stack of unmanaged tools without a sovereign data layer are unlikely to have a defensible position by then.