Skip to main content
Your AI Agents Are Brilliant Goldfish

Your AI Agents Are Brilliant Goldfish

Josef Holm6 min read

Key Takeaways

  • Most AI agents in production are brilliant goldfish with the keys to the company. No memory, no scheduler, no identity, no audit trail. Each one a standalone script wrapped around an API call.
  • Five things break every time you skip the OS layer: memory loss, resource collisions, tool chaos, no identity, no observability. These are the default conditions of agent deployments right now, not edge cases.
  • The agent OS is a middle layer with six functions: scheduler, memory manager, tool manager (with sandbox), identity manager, observability, and guardrails. Each one solves a specific class of failure that kills agent projects.
  • Memory, identity, and observability are non-negotiable from day one. Until your governance layer is mature, a human in the loop on money, external comms, and production data IS your governance layer.
  • The model gets the headlines. The OS does the work. Decide which one you're investing in.

Right now, somewhere in the world, an AI agent is booking flights, writing code, and answering customer tickets. And it has no idea what it did five minutes ago.

That's the actual state of most agent deployments I see in the wild. Brilliant goldfish with the keys to the company. The fix isn't a smarter model. It's the boring infrastructure layer underneath: an operating system for agents.

Most teams are skipping it. That's the mistake.

Why does an AI agent need an operating system at all?

Think about your laptop. Click Spotify, and something invisible decides how the sound reaches your speakers. Run Chrome and Word at the same time, and something keeps them from fighting over memory. Plug in a USB drive and something says "new device, here's how you use it."

That something is the OS. You never see it. Without it, the machine is a brick.

Now look at how most companies deploy AI agents today. No scheduler. No shared memory. No identity layer. No audit trail. Each agent is a standalone script wrapped around an API call, hoping nothing goes sideways.

It works. Until it really, really doesn't.

Here's what most people miss: the model is not the product. The model is one component. The product is the system that decides what the agent can do, when it can do it, what it remembers, and what it's never allowed to touch. That system is the OS.

What actually breaks when you deploy agents without one?

Five things, every time.

Memory loss. Every new conversation starts from zero. Your "intelligent" support agent has no idea this customer has emailed fourteen times this week. The user has to re-explain context they already gave you. That's not intelligence. That's amnesia at scale.

Resource collisions. Ten agents all want the model at the same time. A live customer chat is sitting behind a background job summarizing yesterday's tickets. Nobody is deciding what's urgent.

Tool chaos. An agent writes code and runs it. Where? Against what? With what permissions? In most setups, nobody can answer that cleanly.

No identity. When an agent takes an action on behalf of a user, who actually authorized it? If it spent money or sent an email or changed a record, can you prove the chain of authority? Usually no.

No observability. Something went wrong. A refund got approved that shouldn't have. Why? In most deployments, you can't replay the decision. You're guessing.

These aren't edge cases. These are the default conditions of agent deployments right now.

What does an agent operating system actually do?

Think of it as a three-layer cake.

At the top sit the agents themselves. The travel agent. The coding agent. The customer service agent. These are the workers.

At the bottom is infrastructure. Models, databases, APIs, compute.

The middle layer is where the work happens. That's the agent OS. Six functions live there, and each one solves a specific class of failure I've seen kill agent projects.

The scheduler

Also called the orchestrator. This is the calendar. When ten agents want resources, the scheduler decides who goes first. The live customer chat gets priority over the background summary job. Sounds obvious. Almost nobody implements it.

The memory manager

This is the fix for the goldfish problem. Short-term memory for the active conversation. Long-term memory for what happened last week. Episodic memory for "the last time I tried this approach, it failed." Your HR agent should remember you asked about parental leave last month. Without a memory manager, it can't.

The tool manager

Agents need to do things. Send emails. Query databases. Call APIs. The tool manager is the toolbox. It knows what tools exist, who can use them, and runs them in a sandbox.

The sandbox matters. If an agent writes Python and runs it, you do not want that code reaching your production database by accident. The sandbox is the padded room. The agent can try things without burning the building down.

The identity manager

Who is this agent and what is it allowed to do? Short-lived tokens. Scoped permissions. A clean chain showing this agent is acting on behalf of this user.

When your travel agent books a flight on your card, there should be an audit trail of exactly who authorized what. Today, in most deployments, there isn't.

Observability

The security camera system. Every decision, every tool call, every response, logged and traceable. When the agent approves a refund it shouldn't have, you rewind the tape and see what happened.

This is the difference between a system you can fix and a system you have to pray about.

Guardrails and governance

Input guardrails check what's coming in. Is someone trying to inject a malicious prompt? Output guardrails check what's going out. Is the agent about to say something wrong, harmful, or off-policy?

Governance is the policy layer. Some actions need a human in the loop. Refunds under $50 go through. Over $50, a person approves. Some data is just off limits. Some decisions are too important to automate, and the OS is what enforces that line.

Why is nobody talking about this?

Because it's not glamorous. It doesn't demo well. Nobody posts a viral video about their scheduler.

The market is obsessed with the agent layer. New frameworks every week. New model releases every month. Demos of agents doing impressive single-shot tasks in controlled environments.

Then the same teams try to put those agents in production and discover that a demo and a system are not the same thing. A demo runs once, in front of an audience, with a human babysitting. A system runs ten thousand times a day, unattended, with real money and real customers on the other end.

The gap between those two things is the OS layer. Most teams are crossing it by hand, with duct tape, hoping they don't get paged at 3 AM.

This is the same pattern I've watched play out in every infrastructure cycle for thirty years. Web hosting in the late 90s. Cloud in the late 2000s. Containers in the mid-2010s. The companies that built or adopted the operating layer early scaled cleanly. Everyone else spent two years rewriting their stack under pressure.

What should an operator actually do about this?

A few things, in order.

Stop deploying single agents in isolation. If you have one agent in production today, you'll have ten next year. Build the assumption of multiple agents into your architecture now, even if you only have one.

Pick your OS layer deliberately. You have three options. Build it yourself, which is expensive and slow but gives you full control. Adopt an emerging open-source agent framework that handles part of this. Or wrap commercial tooling around a thinner orchestration layer you own. There's no universally right answer. There is a wrong answer, which is "we'll figure it out later."

Treat memory, identity, and observability as non-negotiable from day one. You can ship without a fancy scheduler. You cannot ship a serious agent system without knowing what it remembers, who authorized it, and what it actually did. If your current setup can't answer those three questions, you don't have an agent system. You have a liability.

Put a human in the loop on anything that moves money, sends external communication, or touches production data. Until your governance layer is mature, the human is the governance layer. That's fine. What's not fine is pretending you have governance when you have a vibe.

This is the kind of thing we work through with operators in our AI Operating Audit. The audit isn't about whether your model is good. It's about whether the system around the model is something you can actually trust in production. Most of the time, it isn't yet. That's normal. What matters is knowing where the gaps are before they become incidents.

The real point

AI agents are not a future thing. They are handling real customer interactions, real money, and real decisions today. Most of them are doing it without the infrastructure to be reliable.

That's like running a city without traffic lights. It works until two cars meet at an intersection.

Teams that build the OS layer first will scale agent systems cleanly. Teams that don't will spend the next two years firefighting fragile deployments and quietly rolling back the ambitious ones.

The model gets the headlines. The OS does the work.

Decide which one you're investing in.

Infographic

Infographic summary of: Your AI Agents Are Brilliant Goldfish

Frequently Asked Questions

What is an agent operating system?
It's the middle layer between your AI agents and the underlying infrastructure. It handles scheduling, memory, tools, identity, observability, and guardrails. Without it, each agent is a standalone script hoping nothing goes wrong.
Why can't I just deploy AI agents directly on top of a model?
Because the model is one component, not the product. Without an OS layer, you get memory loss, resource collisions, tool chaos, no identity, and no observability. It works in a demo. It breaks in production.
What are the six functions of an agent OS?
Scheduler, memory manager, tool manager, identity manager, observability, and guardrails plus governance. Each one solves a specific class of failure that kills agent projects in production.
What's the minimum I need before putting agents in production?
Memory, identity, and observability. You need to know what the agent remembers, who authorized it, and what it actually did. If you can't answer those three questions, you don't have an agent system. You have a liability.
Should I build the agent OS myself or buy it?
Three real options: build it, adopt an open-source agent framework, or wrap commercial tooling around a thin orchestration layer you own. There's no universally right answer. The wrong answer is 'we'll figure it out later.'
When do I need a human in the loop?
On anything that moves money, sends external communication, or touches production data. Until your governance layer is mature, the human is the governance layer. That's fine. Pretending you have governance when you don't is not.