What is Caveman and how does it cut token usage?

Caveman is an open-source Claude Code skill built by 19-year-old developer Julius Brussee. It intercepts the model's output layer and strips articles, filler words, hedging, and rhetorical padding, replacing full sentences with fragments and short synonyms. Average savings are 65% on standard developer tasks, peaking at 87% on complex debugging. File paths, terminal errors, code blocks, and URLs are preserved byte-for-byte.

What are Cavekit, Cavemem, and Caveman Code?

They're the architecture stacking up around the core Caveman skill. Cavekit centralizes a project into a single SPEC.md file in Caveman syntax. Cavemem stores compressed session observations in a local SQLite database so agents stop burning tokens re-explaining the codebase. Caveman Code is a CLI that stacks four compression layers across prompt, tool calls, agent output, and long-lived context files.

Should I keep paying for bigger context windows?

Probably not as a default. Most teams bet on bigger context windows and dump more documentation in, hoping the model figures out what's relevant. Caveman points the other direction: density beats volume. A 2,000 token compressed spec beats a 40,000 token sprawling codebase dump, because the model can actually hold the compressed version in working attention without drift.

What's the broader lesson beyond saving money on inference?

Tokens behave like compute, storage, and bandwidth: ignored at small scale, decisive at large scale. The teams getting real use from AI agents treat tokens as a designed resource, not tap water. Same lesson at the company level: the winners aren't the ones feeding the most data into the model, they're the ones feeding the right data, in the right form, at the right time.

What is Caveman and how does it cut token usage?

Caveman is an open-source Claude Code skill built by 19-year-old developer Julius Brussee. It intercepts the model's output layer and strips articles, filler words, hedging, and rhetorical padding, replacing full sentences with fragments and short synonyms. Average savings are 65% on standard developer tasks, peaking at 87% on complex debugging. File paths, terminal errors, code blocks, and URLs are preserved byte-for-byte.

What are Cavekit, Cavemem, and Caveman Code?

They're the architecture stacking up around the core Caveman skill. Cavekit centralizes a project into a single SPEC.md file in Caveman syntax. Cavemem stores compressed session observations in a local SQLite database so agents stop burning tokens re-explaining the codebase. Caveman Code is a CLI that stacks four compression layers across prompt, tool calls, agent output, and long-lived context files.

Should I keep paying for bigger context windows?

Probably not as a default. Most teams bet on bigger context windows and dump more documentation in, hoping the model figures out what's relevant. Caveman points the other direction: density beats volume. A 2,000 token compressed spec beats a 40,000 token sprawling codebase dump, because the model can actually hold the compressed version in working attention without drift.

What's the broader lesson beyond saving money on inference?

Tokens behave like compute, storage, and bandwidth: ignored at small scale, decisive at large scale. The teams getting real use from AI agents treat tokens as a designed resource, not tap water. Same lesson at the company level: the winners aren't the ones feeding the most data into the model, they're the ones feeding the right data, in the right form, at the right time.

Talk Like a Caveman: The Claude Code Skill That Cuts Tokens by 65%

Josef Holm5 min readMay 15, 2026

Key Takeaways

Every "Sure! I'd be happy to help!" your coding agent emits is billable. At scale, you're funding an AI politeness habit nobody asked for.
Caveman, a 19-year-old's open-source skill with 60,000 GitHub stars, strips fluff from Claude Code output and cuts tokens 65% on average, 87% on complex debugging, without touching code, file paths, or terminal errors.
Brevity doesn't make models dumber. A 2026 paper found large models jumped 26.3 points on hard logic benchmarks under brevity constraints. Verbose prompting masks capability; tight prompting exposes it.
Stop betting on bigger context windows. A 2,000 token compressed spec beats a 40,000 token codebase dump because the model can actually hold it in working attention.
Tokens are a resource worth designing around. Less surface area, more signal. Tighter context, sharper reasoning. Stop paying for the pleasantries.

Talk Like a Caveman: The Claude Code Skill That Cuts Tokens by 65%

If you're building with Claude Code, Cursor, or Codex, you're paying a tax you probably haven't audited. Every "Sure! I'd be happy to help you with that!" costs money. Every "the", "actually", and "basically" your model emits is billable. At scale, you're funding an AI politeness habit nobody asked for.

A 19-year-old developer named Julius Brussee noticed. His open-source project Caveman has crossed 60,000 stars on GitHub by doing something almost rude in its simplicity: force the agent to drop the fluff and talk like a caveman.

The savings aren't marginal. They're structural.

What does Caveman actually do?

Caveman is a compression skill that intercepts the model's output layer. It strips articles, filler words, hedging, and rhetorical padding. Fragments and short synonyms replace full sentences.

Average output: 65% fewer tokens on standard developer tasks. On complex debugging, it peaks at 87%.

Does it break code? No. File paths, terminal errors, code blocks, and URLs are preserved byte-for-byte. There's also an "Auto-Clarity" reflex that snaps back to standard English the moment a security warning or destructive database command shows up. Compression is aggressive where it can be and surgical where it has to be.

That distinction matters. Most "save tokens" tricks degrade reliability somewhere downstream. Caveman draws a line: compress prose, preserve anything the machine needs literal.

Doesn't brevity make the model dumber?

This is where most people get it wrong.

The intuition is that a large, highly parameterized model needs room to "think out loud" to perform well. So forcing it to speak in fragments should hurt reasoning. The data says the opposite.

A 2026 paper by MD Azizul Hakim, Brevity Constraints Reverse Performance Hierarchies in Language Models, found that verbose prompting actively masks model capability. When forced to be brief, large models stop overthinking and stop accumulating low-probability errors along the way. Accuracy on complex logical benchmarks jumped 26.3 percentage points under brevity constraints.

Read that again. Brevity didn't just save tokens. It made the model measurably smarter on hard problems.

This lines up with something I've watched play out across thirty years of building software. Verbose systems hide their own bugs. Tight systems expose them. Same principle for language models: more words give the model more room to drift, hedge, and contradict itself. Less surface area, fewer places for errors to hide.

What if you stacked compression across the whole stack?

That's where things get interesting. Caveman started as a single skill. It's turning into an architecture.

Cavekit handles workflow. It centralizes a project into a single SPEC.md file written in Caveman syntax, using mathematical symbols like ∀ (for all) and → (leads to) to encode logic. Every test failure gets logged and permanently converted into a compressed invariant. The agent doesn't repeat the same mistake on the next run because the lesson is now part of the spec, encoded in a form that costs almost nothing to read.

Cavemem handles memory. Agents are amnesiacs. Every new session, developers burn thousands of tokens re-explaining the codebase. Cavemem stores session observations in a local SQLite database, but runs them through the Caveman compressor before indexing. The agent recalls deep historical context without flooding its active context window.

Caveman Code, shipping soon, is the flagship CLI. It stacks four compression layers:

L01 normalizes and shrinks the initial prompt.
L02 routes tool calls through a Reduced Token Kernel that compresses noisy terminal output (like npm install logs) by up to 90%.
L03 compresses agent output via the core Caveman skill.
L04 pre-compiles long-lived context files like CLAUDE.md into dense artifacts.

Four independent hops. Each one removes waste the previous layer left behind. The cumulative effect on a real coding workload isn't 65%. It's closer to an order of magnitude.

Why does this matter beyond the cost line?

The obvious read: cheaper inference, longer sessions, smaller bills. True. But it's the boring part.

The interesting part is what this tells you about how to design agentic systems.

Most teams I talk to are betting on bigger context windows. The pitch is always the same: more context equals better reasoning, fewer hallucinations, smarter agents. So they pay for larger windows, dump more documentation in, and hope the model figures out what's relevant.

Caveman points the other direction. The answer to complex agentic workflows isn't a larger, more expensive context window. It's a smaller, sharper one. Density beats volume. A 2,000 token compressed spec beats a 40,000 token sprawling codebase dump, because the model can actually hold the compressed version in working attention without drift.

Same lesson we keep running into with AI integration at the company level. The teams that win aren't the ones who feed the most data into the model. They're the ones who feed the right data, in the right form, at the right time. That's an engineering problem, not a budget problem.

What should you actually do with this?

If you're running AI coding agents in production, a few things are worth doing this week.

Start with an audit of your token spend by category. How much of your monthly bill is going to model pleasantries, verbose tool outputs, and re-explaining the same codebase every session? Most teams have never run this number. The first time they do, it changes how they architect.

Next, try Caveman on one workflow. Pick a debugging task. Run it twice, once standard, once with Caveman enabled. Compare token cost, time to resolution, and accuracy. Let the data tell you whether the compression is worth it for your specific use case.

Then stop assuming bigger context windows are the answer. They're an answer. They're rarely the cheapest or the most reliable one. The teams I see getting real use from AI agents are the ones treating tokens as a designed resource, not tap water.

This is exactly the kind of operational drift we run into during an AI Operating Audit. Companies are spending real money on AI tooling without ever asking whether the architecture underneath it is wasteful by default. Most of the time, it is. The fix isn't a bigger model. It's a sharper system around the model you already have.

The takeaway

Tokens are a resource worth designing around. They behave like compute, storage, and bandwidth: ignored at small scale, decisive at large scale. The companies that figure this out early will run AI workflows at a fraction of what their competitors pay, with better accuracy on top.

Caveman is one tactic. The lesson behind it is the real point. Less surface area, more signal. Tighter context, sharper reasoning. Stop paying for the pleasantries.

If you're trying to figure out where your AI stack is leaking money and capability, that's the kind of work we do every day at Holm Intelligence Partners. The good news: the fixes are usually closer than people think.

Get Caveman on Github: https://github.com/juliusbrussee/caveman

Infographic

Frequently Asked Questions

What is Caveman and how does it cut token usage?: Caveman is an open-source Claude Code skill built by 19-year-old developer Julius Brussee. It intercepts the model's output layer and strips articles, filler words, hedging, and rhetorical padding, replacing full sentences with fragments and short synonyms. Average savings are 65% on standard developer tasks, peaking at 87% on complex debugging. File paths, terminal errors, code blocks, and URLs are preserved byte-for-byte.
Does forcing the model to be brief hurt its reasoning?: The opposite. A 2026 paper by MD Azizul Hakim found that verbose prompting actively masks model capability. When forced to be brief, large models stop overthinking and stop accumulating low-probability errors along the way. Accuracy on complex logical benchmarks jumped 26.3 percentage points under brevity constraints. Less surface area means fewer places for errors to hide.
What are Cavekit, Cavemem, and Caveman Code?: They're the architecture stacking up around the core Caveman skill. Cavekit centralizes a project into a single SPEC.md file in Caveman syntax. Cavemem stores compressed session observations in a local SQLite database so agents stop burning tokens re-explaining the codebase. Caveman Code is a CLI that stacks four compression layers across prompt, tool calls, agent output, and long-lived context files.
Should I keep paying for bigger context windows?: Probably not as a default. Most teams bet on bigger context windows and dump more documentation in, hoping the model figures out what's relevant. Caveman points the other direction: density beats volume. A 2,000 token compressed spec beats a 40,000 token sprawling codebase dump, because the model can actually hold the compressed version in working attention without drift.
What's the first thing to do if I'm running AI coding agents in production?: Audit your token spend by category. How much of your monthly bill is going to model pleasantries, verbose tool outputs, and re-explaining the same codebase every session? Most teams have never run this number. Then try Caveman on one debugging workflow, run it twice (standard versus compressed), and compare cost, time to resolution, and accuracy.
What's the broader lesson beyond saving money on inference?: Tokens behave like compute, storage, and bandwidth: ignored at small scale, decisive at large scale. The teams getting real use from AI agents treat tokens as a designed resource, not tap water. Same lesson at the company level: the winners aren't the ones feeding the most data into the model, they're the ones feeding the right data, in the right form, at the right time.