Guides·Guide
Orchestrating containers is solved. Keeping a fleet of long-running agents alive, dense and versioned is the job nobody warns you about.
TL;DR
A fleet is many long-running agents run as one managed estate. Keeping it healthy means auto-recovery, smart placement and density, versioning and per-agent observability. Generic orchestration like Kubernetes moves containers around but does not know what an agent is, so the agent-level operations are still yours.
A fleet is what you have when individual agents stop being something you watch one by one and become an estate you manage as a whole. Dozens, hundreds or thousands of long-running OpenClaw agents, each doing real work, each able to fail in its own way, all expected to stay up at once.
Fleet management is the discipline of keeping that estate healthy: knowing the state of every agent, recovering the ones that fall over, placing them on capacity efficiently, versioning what they run, and seeing what each one is doing, without a human in the loop for every event.
Underneath "orchestration" sit five jobs that have to run continuously.
01
In-pod restart
A daemon restarts OpenClaw the instant it dies.
02
Pod recreation
If the pod fails, it is recreated with state intact.
03
Known-good restore
Config auto-repair and a versioned restore.
04
Critical alert
Only if all else fails, with a full post-mortem.
It is tempting to assume a container orchestrator solves this. It does not. Kubernetes is excellent at scheduling and restarting containers, but it has no idea what an OpenClaw agent is. It will not restart a daemon that died while the container kept running, repair a corrupted agent config, version an agent's filesystem, or stop one agent's memory spike from taking neighbours down on a shared node. You would build all of that agent-aware logic on top, which is to say you would build a control plane.
That is the real choice: not Kubernetes or not, but build an agent-aware control plane yourself or run on one.
Building it is a serious, ongoing engineering investment: a desired-state store, health and recovery loops, a placement and density strategy, version resolution, secret handling and per-tenant observability, all hardened for production and maintained as OpenClaw evolves. For most teams that did not set out to become an infrastructure company, that is the wrong thing to own.
If you want the full reasoning on where to run agents, the guide on the best platforms to host AI agents at scale lays out the options side by side.
Molted is a managed control plane for fleets of OpenClaw agents. It treats your desired state as the source of truth and continuously reconciles reality to it: self-healing catches a crash in under 60 seconds and 90% are resolved before the client notices, safe over-provisioning packs roughly 3x more agents per machine without crashing nodes, new instances provision in under 18 seconds, and each agent gets a versioned filesystem, 1,000+ integrations, browser automation and its own mailbox and phone.
The same team operates molted.cloud, where this runs 11,000+ instances for 300+ clients, in the cloud or on-premise. You manage the fleet from one dashboard and API instead of writing the orchestration yourself.
Q.01
You can run OpenClaw on Kubernetes, but Kubernetes orchestrates containers, not agents. It will not restart a dead OpenClaw daemon inside a healthy container, repair a corrupted config, version agent state, or stop a memory spike from taking down neighbours. You would build that agent-aware layer on top yourself.
Q.02
Orchestration places and restarts workloads. A runtime understands the agent itself: it keeps the agent process alive, repairs its config, versions its files and protects shared capacity. Fleet management for OpenClaw needs the runtime layer, which is what a managed platform like Molted provides.
Q.03
Through safe over-provisioning plus active protection: the platform packs agents densely because most stay light, but monitors real usage and steps in before a spike starves the node. That is what lets Molted run roughly 3x more agents per machine than naive one-box-per-agent hosting.
Q.04
On Molted, self-healing detects a crash in under 60 seconds and resolves about 90% of them before the client ever notices, bringing the agent back with its state intact. Across a fleet, that automatic recovery is the difference between an estate that runs itself and a permanent on-call rotation.
Keep reading