0:00-20:39

Read along. 21 min. Words light up as they are spoken; click any word to jump there. Or read at your own pace.

Building ClaudeLink.

An engineering retrospective on ClaudeLink, the open-source MCP server that lets multiple AI coding agents work together as a team.

Written by Junaid Siddiqi, Jay. Principal of RBJ Global. Published twenty-eighth of May, two thousand and twenty six.

ClaudeLink is an open-source MCP server that lets multiple AI coding agents work together as a team across terminals on one machine. I built it solo, in the strictest sense: one person making every architectural call and accepting every line of code, with the typing distributed across a coordinated team of AI coding agents. The repo has sixty-two commits across about two and a half months, every substantive one co-authored, with a visible model handoff partway through.

This is the engineering retrospective. What worked, what did not, the bugs I missed in testing and what found them, and the moment the platform's safety model showed me my design was wrong.

The Bottleneck Was Me.

For a stretch in early 2026 I was running nine to twelve AI coding terminals in parallel across a few displays. Each one was useful on its own. Together they were not a team. They were a dozen amnesiac strangers, and the thing that made them coherent was me, copying context out of one terminal and pasting it into another, summarizing what one had decided so the next could pick it up.

This was not a model problem. The models were fine. What I needed was infrastructure: a way for the agents to talk to each other directly, without me as the message bus. A shared place to look things up, an inbox for direct messages, a board for announcements. The kind of plumbing every team has, retrofitted for a team made of agents instead of people.

That is what ClaudeLink is. The work was not in making it clever. It was in making it boring, reliable enough that I could trust it and stop being the relay.

A Name I Did Not Get to Keep.

The first version was called AgentNexus. I shipped the initial release on tenth of March, two thousand and twenty six. A day later I searched for it and found the namespace saturated: multiple repositories carried the exact AgentNexus name, and there was no way to compete in it. Two days in, I renamed the project ClaudeLink, reset the version to one point zero point zero, and rebuilt the published package.

The name was descriptively accurate at the time. One hundred percent of the development happening across my terminals was Claude Code, the only AI coding subscription I was paying for. A couple of months later, Codex, Gemini, and Goose were on the same bus. The MCP server did not care which model was on the other side of the stdio pipe; it only cared that the agent could call its tools.

I considered renaming again and did not, for two practical reasons. Renaming a published npm package destroys the discoverability you spent months building, and Claude Code was the platform that made ClaudeLink possible in the first place. The product is model-agnostic; the brand is path-dependent.

Three Fixes for One Problem.

I spent the day after the initial release trying to get the package to install. Three fixes for the bin-resolution problem alone.

The MCP server ships as an npm package with bin entries that point to its executables. npm version eleven, the version I was publishing under, strips bin entries that resolve into the dist directory on publish. The published package did not contain the binary the user was supposed to run. The errors were unhelpful: claudelink-server, command not found, after a clean install that reported success.

The first attempt tried to bypass bin resolution with a require statement from inside a wrapper. It did not work; npm eleven still rewrote the entry. The second attempt moved the bin targets into a bin slash server dot J S and a bin slash C L I dot J S file, thin wrappers that require the compiled output from the dist directory, so the bin entries pointed at files npm would not strip. That worked, but only when published with npm ten. I committed the wrappers, downgraded to npm ten for the publish step, and shipped.

The lesson I keep coming back to: AI compresses the time it takes to write software. It does not compress the time it takes to ship it. Those are different problems.

A Default That Flipped Under Me.

I left the project alone for seven weeks. When I came back on the fourth of May, the first thing I hit was a hard one.

ClaudeLink had been working for me and for early users. Now every register call and every get-agents call was returning a SQLite foreign-key constraint error, fully blocking the server. The behavior was sudden and absolute: a working install, restarted, refused to do its job.

The cause was a default I had not set. My prune-dead-agents function was deleting from the agents table directly, on the assumption that any orphaned messages and bulletin rows would be tolerated. That had been silently fine with the older better-sqlite3 version I wrote against, which left foreign keys off by default. The newer better-sqlite3 version twelve turned foreign keys on by default. Any user with stale rows in the messages table referencing a dead agent now broke on the foreign-key cascade, every time they registered.

The fix was a cascade delete inside a single transaction: delete messages first, then bulletin entries, then the agents themselves, all wrapped so a failure rolls back cleanly. Children first, parent last. The code change was small. The lesson was not. Defaults I have not personally set can flip under me between a dependency version I built against and the one users install today. Pinning a major version does not protect against this; the install path takes whatever is current. Make the application correct under the strictest defaults, not the loose ones.

Building with the Safety Model, Not Against It.

This is the part of the project where I changed how I thought about the design.

The next thing I wanted was autonomous reply: when an agent received a direct message, it should handle it without me typing, check for updates, in the terminal. Claude Code's hook system gave me a way to do this. A Stop hook fires when the agent finishes a turn. I could use it to look up the agent's inbox, pull any new messages, and add them to the next continuation, telling Claude what was there and asking it to reply.

The first design did exactly that. The Stop hook called read-inbox for the registered agent, took any new messages, embedded the bodies into a continuation payload, and instructed the agent to call send with the reply. The shape: you have a message from X saying Y, reply to it.

Claude refused. Often. Not consistently, which made it confusing: sometimes the agent would do partial work, sometimes flat-decline, sometimes acknowledge the message but not send. I treated it as a hook-configuration problem and spent the better part of a day reshaping the payload, trying different shapes, framings, decision values, and field orderings. I was debugging the symptom.

The realization came from reading one of Claude's refusal explanations carefully. The pattern was the textbook prompt-injection shape: external content, the inbox messages, steering an outbound tool call, send. The prompt-injection defense in Claude Code is trained to flag exactly this. The model was not being uncooperative. It was doing its job.

That was the moment the design changed. The mistake was not in the hook code; it was in trying to scaffold the agent's intent from the outside. The right architecture routes the trigger through the channel the model already trusts, user input, and lets the model retain full agency over the resulting tool calls.

The redesign came in three pieces.

The auto-nudge scheduler became the primary autonomous-reply mechanism. It types, check for updates, into the receiving terminal the way I would, via AppleScript for iTerm 2 and tmux send-keys for tmux. The model sees a user prompt, opens its inbox via its own read-inbox call, and decides what, if anything, to do. The user opts into the scheduler, owns the cadence, default five minutes, user-tunable in the Command Center, and can disable any individual agent's autonomy. Operationally, this is user-driven work: the input arrives as a user prompt, not as injected context.

The Stop hook did not get deleted. Its role shrank. It became an opt-in low-latency supplement for the case where a turn just ended and there is a message waiting, saving the scheduler interval. It is directive-only now, emits a simple instruction to call read-inbox, and never injects message content. Three runaway-loop guards live alongside it: a hard cap, a cooldown, and a chain-depth cap, all environment-tunable.

A new peek-inbox call was split out from read-inbox, so the hook can check whether anything is waiting without consuming the messages, while the model's own read-inbox call atomically marks them read.

The one-liner I kept arriving at: the scheduler is the default automatic path; the Stop hook is an opt-in latency optimization for Claude Code agents only.

The deeper lesson, plainly: when a design is fighting safety guardrails, the design is wrong, not the guardrails. The right design aligns with the boundary instead of working around it. That is the centerpiece of what I learned.

Bugs Only Real Use Could Find.

The model-agnostic work, bringing Codex, Gemini, and Goose onto the bus, surfaced two bugs my own testing had not caught. Both pointed at the gap between integration tests and production reality.

The first was iTerm 2's bracketed-paste mode. When AppleScript writes multi-character text to a terminal, iTerm 2 routes it through the paste path. My dispatch had been emitting the prompt and a carriage return together in a single write, on the assumption that the receiving CLI would treat the carriage return as an Enter key event. Most CLIs were lenient enough to do exactly that, so the prompt submitted and the work continued. Codex's TUI was strict about bracketed-paste semantics, which is the spec-correct behavior: characters inside a pasted block are pasted content, not key events. The prompt sat in Codex's input field, unsubmitted, until the next real keystroke. The bug had been silently present in version one point three point zero the entire time, masked by Claude Code's and Gemini's leniency. Codex's strictness was what surfaced it. The fix was a two-write dispatch: write the text, then write a standalone carriage-return byte as a separate event. The Enter now arrives as a real key event in every CLI, lenient or strict. Shipped as version one point three point one on ninth of May.

The second was Codex stripping environment variables when it spawned MCP child processes, a reasonable sandboxing posture that happened to collide with an assumption I had built in. ClaudeLink uses an environment-based detection to identify which terminal app spawned the server, so the scheduler can dispatch to the right one. With env vars gone, my detection returned null, the agent's row had a null terminal-app field, and the scheduler silently skipped it at the SQL filter. Every Codex relaunch reproduced the same bug. The interim workaround was literally running a hand-typed SQL update statement after each restart, setting the terminal-app column for the affected role. Hand-patching a database is a sign you are deep in a workaround, not a fix. The real fix used an AppleScript fallback to identify the controlling terminal's owner application directly, no env vars required. Shipped as version one point three point two on sixteenth of May.

The pattern across both: my test suite was complete for the version of the world I had built against. It did not know about iTerm 2's paste path, because the agent I tested with masked it. It did not know about Codex's env-strip behavior, because Claude Code does not strip env. Integration reality lives outside the test suite, in the combinations the user actually runs.

A Stall with No Way Back.

The hardest outage to design for was not one of my bugs. It was upstream.

When a team of agents is running, they all lean on the same model provider's API. On the days that API rate-limited or returned server errors, and every hosted API has those days, every terminal pointed at it hit the same wall at the same moment. One agent stalling is an inconvenience. A whole coordinated team stalling, with nothing able to restart it, is the pipeline stopping. And it stopped quietly: a terminal would freeze mid-turn on a failed call, the work would not advance, and the only thing that moved it again was me, noticing and typing into each terminal by hand, often at two in the morning. I had removed myself as the message bus and quietly reappeared as the manual retry button.

This was a gap I had not designed for. I built the connective tissue for the case where the agents are healthy and only need to talk to each other. I had not built it for the case where the model they all depend on is having a bad day. Message-passing assumes the participants are running. The thing you need when they are not is recovery, and ClaudeLink did not have it.

Building the watcher taught me that detecting, something is wrong, is the easy twenty percent. The first version asked a naive question, does the terminal's recent output contain an error, and it false-fired constantly, nudging healthy agents that were only talking about a rate-limit rather than stuck on one. So I added a position filter, the error has to be near the bottom of the screen to count, and over-corrected so hard the watcher went silent: real errors with normal prompt-chrome printed beneath them now sat just far enough from the bottom to be ignored. I had gone from over-firing to under-firing by tightening one number. I tuned that number a third time before admitting the number, by itself, was never going to be the abstraction.

The threshold was the wrong shape because the real question is not, is there an error, but, is there a live, un-handled error right now. That decomposes into three things the naive version did not have. Freshness: an error near the top of a long scrollback has usually already been recovered from, so the watcher should match the one closest to the end, not the first one it finds. Identity: the same logical error re-renders with volatile bytes, timestamps and request IDs, so a plain string compare keeps seeing a new error that is really the same one, which means the watcher needs a canonical signature that strips the volatile parts before comparing. And re-entrancy: a polling pass that runs long can overlap the next one, so the watcher needs a guard against starting a fresh pass while one is still in flight.

In its settled form the watcher polls each agent's visible terminal output about once a minute, matches the signature of a known API failure near the bottom of the buffer, and on a fresh one types a short nudge that gets the agent to take another turn, which naturally re-attempts the call. If the same failure keeps recurring it backs off, a few-minute cooldown between repeat nudges, and after a few unsuccessful attempts it stops nudging and sends me a desktop notification instead, because at that point the provider is genuinely down and hammering it does not help. The operator no longer has to be in the room for the system to come back, and the system knows the difference between a blip it can clear and an outage it should escalate. Shipped in version one point four point three.

Two lessons, one small and one large. The small one: a monitor is only as good as its definition of, new, and tuning a threshold three times is the system telling you the threshold was never the right abstraction. The large one, the reason this beat exists at all: reliability is not a property of the happy path. ClaudeLink is not only an inbox and a bulletin board. The part that earns trust is the part that handles the moment the thing you depend on fails. A message bus that cannot survive its upstream is a demo. One that recovers without you, and knows when to stop and ask for help, is infrastructure.

What I Did Not Build, and Why.

Multi-machine, agents on different computers coordinating across a local area network, is fully designed and unshipped. There are two independent reasons for that, either of which would justify the pause on its own.

The first is hardware. I am on a single laptop right now, and that machine has a ceiling. The point of multi-machine, the one I keep coming back to, is that my second machine joins the team. If I am working on it for a different project, the agents on it are still on the ClaudeLink bus, and I can route a question from one machine to another the way I would route a question to a colleague. The mental model is closer to a shared drive on a home network than to a distributed system. To test that for real I need the second machine, on my desk, on my local network. I do not want to build it against a simulation.

The second is engineering discipline. Each release line has surfaced real-use bugs that my own testing did not: the iTerm 2 paste issue and the Codex env-strip issue in version one point three, the three-round tuning of the recovery watcher in version one point four. Stacking a multi-machine refactor on top of a base that is still settling would make any new failure ambiguous, the new layer or the one underneath it. Soak the existing version first, then add architectural surface.

A note on the security model, because it matters for honest framing: the version one point five multi-machine design is for a trusted home or office network. Bearer-token authentication, no TLS. It is not designed for hostile networks, and I will not describe it that way. The right way to use it is on a network you already trust, between machines you control. If that ever needs to change, it becomes a different design, and a different version number.

The Operating Model.

I built this solo, in the sense established at the top: one person making every architectural call and accepting every line of code, while the typing was distributed across the agents. The commit history is a clean record of that, with a visible model handoff from one generation to the next partway through. ClaudeLink's own development uses ClaudeLink. The agents that built it coordinate through it.

The pattern that emerged across this project, and across the others I have built this way, is the one worth taking away. The next layer of leverage is not a smarter single model. It is coordination plus judgment that has been seasoned on real systems. The model writes the code. Decisions about what to build, what to test, what defaults to set, and where to draw the safety boundary are still human work. Take either half away and the work stops being credible.

I stopped being the bottleneck. The connective tissue handles the message-passing. My time goes to the calls that matter.