Engineering

Orchestrating a fleet of AI agents

Junaid Siddiqi (Jay), Principal · 13 June 2026 · ~16 min read

Four-display engineering studio. A laptop is centered on the desk under the middle monitor, flanked by an ultrawide on the left, a wide on the right, and the center display above. All four screens show terminal windows with dense text. A condenser microphone stands to the right of the laptop. The room has acoustic-tile walls.
The studio. Four displays for the engineering work, music gear behind them for the rest of the week. One room, both halves of a normal life. Sixteen terminals at any given time, eighteen at peak.

I am a working software engineer with a day job. During evenings, weekends, and most spare hours, I am building a small portfolio of products using the kinds of tools that have become standard across software engineering over the last year. The studio in the photo above is real. The four displays in front are where the engineering work happens. The synthesizers and the music gear behind them are where the rest of my time goes, on the weekends and on the days when the code stops talking to me. I am not showing the instruments off. They are in the frame because that is what a normal week looks like for me. Engineer in the day, founder in the evenings, hobbyist on the weekends. The point is that the operator at the center of this fleet is not a unique creature.

The most unusual fact about the last two months is that on most evenings I have sixteen AI coding agents open in parallel across that desk, eighteen at peak. That is not a flex. The agents are not all working on the same product; they are split across our product lines. Clawless, WhisprDesk, Trading Agents Lab, iLoveMD, ClaudeLink, and Clawddesk each have a dedicated developer terminal, with another for the parent site itself. The largest concentration of agents over the last six weeks has been on Clawdemy, an educational curriculum site that shipped its first end-to-end production version earlier this week: a six-agent development team, a lead, and a review team. The only way to ship a body of work that size while keeping the other products moving in parallel was for the agents to work concurrently and for me to coordinate them rather than write the content myself.

This is the engineering retrospective on what coordinating that fleet actually taught me. None of it is about the models. The models are fine. What changed is the shape of the work I do as the human in the loop, and the discipline that survived contact with reality. The most important thing I will say in this piece is that the number of windows on the desk is not the bar. The bar is the discipline, and the bar is achievable by anyone with senior-engineer or senior-product experience and the appetite to learn by doing.

The bottleneck moved

The intuition I started with was that the constraint on AI-assisted engineering is model quality. Smarter model, more done per unit time. That intuition was wrong in a way I did not fully feel until the fleet was running.

The constraint, at this scale, is the human orchestrator. When sixteen agents are working in parallel and each one is producing usable work in roughly the same amount of time it takes a junior engineer to produce nothing, the rate-limiting step is no longer typing or design or even review. It is the messages between the work. Who is doing what. What is finished. What is blocked on what. Which assumptions in last hour’s dispatch are now stale. What changed in the shared environment that affects every in-flight terminal.

I did not see this coming. I had assumed I would spend my time reviewing output and writing the next prompt. In practice the review and the prompt are the easy parts. The hard part is keeping the message I send to terminal seven coherent with what terminal three did fifteen minutes ago and what terminal nine is about to do. The agent orchestration is the new programming.

Once that was clear, every discipline I learned in the first few weeks made sense in retrospect. They are all the same shape: make sure the message you are about to send is true in the current state of the world. Most of the bugs I caused were because it was not.

Verify the anchors in your dispatch, not the parallels

The first discipline, and the one I broke the most often before I locked it down, is anchor verification.

When you dispatch fix instructions to an agent, you typically anchor those instructions to something specific. A file path. A line number. A snippet of current content. A URL. A phase boundary. Those anchors are claims about the world. If any of them are wrong, the agent will follow your instructions precisely against a stale model of reality, and the result will be confidently broken.

I broke this three times in one afternoon. I had been amending a curriculum track in waves, and I dispatched a fix to a reviewer agent with three anchors: a line range that was correct two hours earlier, a current-content snippet I quoted from memory, and an assertion that two lessons used the same URL because they had used the same URL the last time I looked. The reviewer agent dutifully tried to apply my fix. Each of the three anchors had drifted. The line range had shifted by an edit upstream. The current-content quote was two revisions stale. And the two lessons no longer shared a URL, because one of them had been updated by a sibling terminal between the moment I dispatched my dependency on the parallel and the moment the reviewer ran.

The lessons agent caught me cleanly on all three. The fix was not better quoting; it was the discipline of grepping each anchor against the actual file state at the moment of dispatch, not at the moment I formed the intent. Do not infer parallels (these two lessons both use that URL). Do not carry forward stale claims (phase one had this flag, so phase two probably does too). Do not trust memory of file content over a fresh read.

The rule I write to my own memory now: every anchor in dispatch text is a claim, every claim gets verified at the moment of dispatch, and the verification is not optional just because I am confident.

The schema is not the runtime

The second discipline is the one that cost me the most damage in a single decision.

I was implementing a feature where the application would write a per-agent configuration file. The framework offered an RPC that, on the schema, accepted a file name and a payload. The type definition was name: NonEmptyString. I read the schema, satisfied myself that any non-empty string would work, and signed off on a design review that approved the implementation.

The implementation failed at runtime. The schema accepted any non-empty string. The runtime handler, one function call deeper, checked the file name against an allowlist of about nine names and rejected anything not on it. The file I needed to write was not on the allowlist. The error message was a polite unsupported file, returned through the same channel as a success would have come through, with no hint that the gate was an allowlist rather than a validation failure.

The fix was to bypass the RPC and write the file directly. But the lesson is the cost.

Reading a type schema is not the same as tracing the runtime handler’s decision points. The schema describes the shape of the request. The handler describes what actually happens. A pre-implementation review that substitutes the first for the second is approving a design against a partial specification. In a sequential, human-paced workflow you can get away with this most of the time because the runtime catches it cheaply and the cycle is short. In a fleet workflow, the design-review agent’s NO-GO carries weight; the developer agent starts work based on it; by the time the runtime catches the mismatch you have stacked an hour of dependent work on a faulty foundation.

The rule I write now: a pre-implementation review may not substitute a type schema for tracing the runtime handler to its actual decision point. Either run a live RPC probe and observe the behavior, or grep the handler to the line where it accepts or rejects.

The deeper principle: any contract that lives in two places, the type and the handler, is a contract that will drift. Find the place that is load-bearing under real load, and verify there.

Patches that report success but did not happen

The third discipline is one I did not believe at first.

I had a sub-agent that produced a list of fixes for a content set, and a separate sub-agent applying them. The applying agent used a tool that performs precise string replacements on files. The tool reports success per call. I had a workflow that ran a batch of about thirty edits and reported a clean run: every patch updated successfully.

A spot check found that five of the thirty had not actually changed the file. The replacement strings the agent had supplied were close to the file content but not exact, and the tool had returned an error on those calls that the wrapping flow had interpreted as success. The file was unchanged. The reporting was wrong.

I lost an evening to that. The defense I put in place fleet-wide afterward is a two-part discipline that has held since: read the file before each edit so the replacement string is current, and verify the change by grep after the commit so the report of success is checked against the actual file state. Neither half is optional. The first protects against stale replacement strings; the second protects against the tool reporting wrongly. Both are cheap. The cost of skipping them is a class of bug that is invisible at every layer of the workflow except the one that matters.

The general shape this fits: in a fleet workflow, did the work happen and did the report of work happen are two different questions, and the second one is not a substitute for the first.

The phase boundary lives in the file, not in your head

The fourth discipline came from a single mis-dispatch that the lead agent caught before it propagated.

I was running a track-level audit cycle. The track had ten lessons. I had assumed, based on the way previous tracks were organized, that the phase boundary fell between lesson three and lesson four, splitting the track into three groups of three plus a final group of one. I dispatched the second audit wave to lessons four through six on that assumption.

The lead agent grepped the phase frontmatter on each lesson before running. The actual boundary was three plus six plus one. Lessons four through nine were all phase two; my dispatch had carved off three of them and left them unaudited. The audit would have shipped the track on a partial review. The lead caught it at the gate and I re-dispatched against the verified boundary.

The lesson here is small but worth banking: structural facts that are encoded in the file are not facts to be remembered. Grep them at the moment of dispatch. Inferring structure from naming conventions and lesson numbers is the same class of mistake as inferring file content from memory. It is fast, it is usually right, and the cases where it is wrong cost more than the verification would have.

The audit cycle is sequential, not parallel

The fifth discipline is the one I had to be talked out of breaking, because the failure mode was tempting.

A review cycle on a piece of work returns a list of defects, classified by severity. The temptation, when the deadline is short and the parallel capacity is wide, is to fix the blockers, re-dispatch a fresh audit, and in parallel start moving on to the next subject of review. It feels efficient. The capacity is there. The blockers are addressed. Why hold up the rest of the work for a re-review of the same subject?

The reason, which I learned by trying it once, is that the audit is not a regression test. It is a gate. Each cycle is a stop-fix-rereview loop that runs until the ship criteria are met. Zero blockers, zero majors, acceptable minors. Only when the current subject closes does the review team take the next subject. Treating it like a regression test, where you batch defects and ship anyway, drops the bar of the entire review function. The review team starts behaving as a defect-logger instead of a gate, and the work that ships is the work that survived a logging exercise, not a quality bar.

In practice this means the parallel capacity I thought I had was illusory. The reviewer is a single seat, and that seat is occupied by the current subject until it closes. The right move is to use the parallel capacity for the next subject’s drafting, not the next subject’s review.

The rule I lock now: review cycles are sequential to ship criteria, never parallel. Parallelism is for drafting; sequencing is for review.

Build is the first gate, not the last

The sixth discipline is the one that turned my static checks from theater into infrastructure.

The curriculum project ships content as MDX files. The build that produces the site is the authoritative check that those files parse correctly. For a long time, my pre-commit gates were precision grep patterns that looked for known classes of failure: bare braces in prose, unclosed code fences, YAML colon-space errors. These were necessary. They were not sufficient.

A recent audit cycle, run by the lead agent against a staged build, found roughly three hundred latent breakers across a single track that all of my static grep patterns had passed clean. Most of them were classes of failure I had not added a grep pattern for yet, because I had not seen them in the wild. The build saw all of them on the first try, because the build does not enumerate failure classes. It either parses or it does not.

The discipline that came out of that audit: the build runs before the work is pushed, not after. The executor agent runs the actual production build as part of its own work, before handing off to the lead. The lead runs a second build at its gate. The review team runs the third. The build is no longer the last thing before promotion; it is the first thing after drafting.

The static checks did not get deleted. They got demoted to a heuristic. The build is the only check that knows the full grammar of the thing being shipped.

The general principle: static analysis cannot enumerate the parse classes of the system you are trying to ship. The parser can. In any workflow where you are generating content for a strict consumer, the consumer is the only authoritative check. Put it at the front of the pipeline, not the end.

The human in the middle is the unlock

The disciplines above are the visible half of the work. The invisible half is who is applying them. That is worth saying explicitly, because the question I get most often when I describe this setup is some version of how is one person handling sixteen agents. The answer is not that I am a special kind of person. The answer is that the role at the center of a fleet of AI coding agents is the same role a senior engineer or senior product owner has been playing for years on human teams, only the team members are now producing faster than any human team ever has.

I started this stretch the way many people are starting it now: scared. Claude and ChatGPT were producing code faster than I could read it, and I was not sure how much of what they produced would survive a real production hand-off. The first few weeks I treated the AI tools the way a junior engineer treats a senior: a lot of accept-without-checking. The output came fast. Most of it was good. The parts that were not good were the parts I had not checked, because the tool had reported success.

The shift came when I stopped treating the model as a smarter version of me and started treating it as a very fast junior on my team. A very fast junior still needs the person reviewing their work to know what real failure looks like, where the domain bodies are buried, where the production constraints live, what hand-offs between systems are actually fragile. That review function is what makes the output land instead of pile up. None of the AI tools take that role. They produce. The human reads the production and decides what is shippable, what is close, and what is unsalvageable.

That review function is not work just any user can do. It is work a senior software engineer can do, or a senior product owner who has watched enough release cycles to recognize the shape of trouble, or anyone who has held the big picture of a product end-to-end and feels in their gut what an integration boundary should and should not promise. The shorthand I keep landing on is experienced. The person at the center does not have to be famous. They have to have seen enough failed dispatches in a previous life to know what one feels like on the way out.

That is the part of this story I want to be clearest about. AI alone does not ship product. An experienced person alone runs out of fingers. Both together, sustained, is what produces a working portfolio of small products in two months instead of two years. The bottleneck moved from the model to the orchestrator because the orchestrator is the only one in the room who can tell when the work is right and when it is wrong but in a way the test suite will not catch. That is a normal job. It has always been a normal job. The new part is that the team you are leading happens to be made of agents.

There is a name for this now: loop engineering

Loop engineering is replacing yourself as the person who prompts the agent: you design the system that prompts it instead.

A few days ago I read an essay by Addy Osmani, a Google engineer whose work many of us follow, that named the activity this article has been describing. He calls it loop engineering, and the term is precise enough that everything I have written above lines up under it cleanly. Osmani built on framings from Peter Steinberger and from Boris Cherny, the lead of Claude Code at Anthropic. The term is new. The activity is not. Many of us have been doing it for months without a word for it.

The contrast Osmani draws is this. Prompt engineering is the older skill. You write a careful prompt. The model answers. You read the answer and write the next prompt. You are in the loop, manually, on every turn. Loop engineering is what you do when you stop writing each prompt yourself and start designing the system that prompts the agent on your behalf. The agent runs a loop, plan and act and observe and update, across many steps and often many sessions. Your job is to make that loop reliable: define the goal, the tools the agent can reach, the stopping conditions, the verification gates, the memory it carries between steps, and the maker/checker split where one agent proposes work and a different one verifies it before the loop accepts it.

Every discipline in the sections above is a loop-engineering primitive. Anchor verification is an invariant the loop must satisfy before it acts. Tracing the runtime handler instead of trusting the schema is a verification gate that catches what the type system promised but the handler did not. Read-before-edit and verify-by-grep is a maker/checker split applied to a single tool call. The phase-frontmatter check is a state-read that grounds the loop in the actual file before the loop assumes a structure. The sequential audit cycle is a termination condition: the loop does not advance to the next subject until the current one closes against the ship criteria. The build-first-not-last rule is the loop’s authoritative pass/fail gate moved to the front of the pipeline, where it earns its keep. None of these are prompts. They are the structure of the loop itself.

The infrastructure I have built over the last two months is loop infrastructure. The CLAUDE.md files at each project root are how each loop knows its rules and conventions without me re-prompting on every session. The handoff.md files are how the loop carries context between sessions when one window closes and another opens. The instructions.md files and the per-project skill files are how the loop knows the specific procedures it should follow for repeating tasks. ClaudeLink, the open-source MCP server I built earlier this year, is how multiple loops talk to each other: a message bus, an inbox, a bulletin board, an auto-nudge scheduler, a recovery watcher for the day the upstream model provider has a bad hour. None of this is prompt-craft. All of it is loop infrastructure.

What is distinct about the work I am describing in this article is that it is loop engineering at a scale most of the published examples have not yet reached. The pieces on loop engineering in circulation today describe how to engineer one agent’s loop reliably. I am running sixteen loops in parallel, with a coordination mesh between them, with role separation across agents (developer, lead, reviewer team), and with one human orchestrating across the whole fleet. The single-agent disciplines all still apply per loop. The new layer on top is fleet-level: deciding which loop owns which subject, when one loop is done so another can take its work, when a loop has stalled and needs recovery, how to keep the messages between loops grounded in the real state of the project. That fleet-level layer does not have a settled name yet. The article calls it orchestration because that is the right word for what sits on top of many loops.

I want to say one thing plainly because it is worth saying plainly. I am not a vibe coder. I am not a prompt engineer. I have been doing loop engineering on a working portfolio of products, with a day job running alongside it, and the proof is on disk in open-source commit histories that anyone can read. The reason this is possible is that loop engineering is a leverage that an experienced engineer or product owner can apply directly, without three years of catching up first. The concepts come on the way. The leverage is real now.

The operating model, and how to start

The thing that surprised me most about running this fleet is not how much it produces. It is how much of the work moves out of writing and into dispatching. The agents are good at the production. They will produce as much work as you can coherently describe. The constraint is the coherence of the description, and the coherence of the description is what verification protects.

Every discipline above is the same shape. Make sure the message you send is true in the current state of the world. Verify the anchors. Trace the runtime, not the schema. Read before each edit and verify after each commit. Grep the structural facts. Treat the review as a gate, not a logger. Run the build first, not last.

None of these are model problems. All of them are problems that the human orchestrator introduces by being optimistic about the state of the world at the moment of dispatch. The cost of pessimism is a few seconds of grep per claim. The cost of optimism, when sixteen terminals build on your dispatched message, is dependent work that has to be unwound.

If sixteen terminals feel like a different world, they do not have to be your setup. You do not need terminals to run a fleet. The Claude desktop app has a Projects feature that does the same thing in a friendlier shape. Each project points at a folder, you can have several projects open, and you can talk to each one independently. Point one project at your main repo, another at a sub-agent worktree, another at a docs folder, another at a research notebook, and you will have most of the mesh I am running here without ever opening a terminal. The point is not the windows. The point is the coordination, and the coordination is the same whether the agents live in a terminal or in the Claude desktop sidebar. Sixteen is a side effect of the work I happen to be in. The bar is the discipline, not the gear.

The way I learned this was by doing it. I could have spent three years studying every concept in a curriculum first. I spent two months building a small portfolio of products instead, with AI tools in the loop the whole way, and the concepts I needed showed up in the order I needed them. The proof is on disk: most of these products are open source, and the commit histories are a day-by-day record of what an experienced person plus a fleet of AI agents actually produces. If you are reading this and trying to decide whether to pivot into building with AI, my one piece of advice is to start. Pick a small product. Open a project window. Begin. The concepts will find you on the way.

I write more in a week now than I did in a month before. The work I am proud of is not the production; it is the dispatches that did not have to be re-issued, and the quiet evening realization that an experienced person at the center of a fleet of AI tools is not, after all, a thing that only a small number of people can be. Welcome.