Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Preface

Preface

This book exists because in late April 2026 the CTO of a 20-person hardware company asked one question that has no off-the-shelf answer:

We have a working prototype agent. It was built by a non-developer. We do not yet trust it in production. What is the minimum viable professional stack we should evolve it onto, and what is that decision actually a decision about?

The question masquerades as a tooling shootout. It is not. It is a question about which architectural axes are independent, which are coupled, where the industry’s vendor-selection energy is concentrated today, where the durable substrate will be in 2028, and how a CTO with a 20-person team and a sales-agent-shaped problem can pick a 2026 answer that does not look foolish in 2027.

This book treats the question seriously. It starts from the conceptual frame, walks the component stack one axis at a time, catalogues every credible vendor in each category with pricing, offers decision frameworks, then issues a Hestiia-specific recommendation with the migration sequence. The bibliography — twenty parallel research subagents plus a reviewer pass — is in the appendix.

The pricing claims are dated. Treat any number more than six months old as stale; the agent-tooling market reprices itself two to three times a year, and category leaders are acquired or pivot on a roughly annual cadence.

How to Read This Book

How to Read This Book

There are three reading paths.

The Linear Path. If you have not yet decided to build an agent farm, read straight through. Part I gives you the conceptual scaffolding; Part II the component vocabulary; Part III the vendor map; Part IV the decision lenses; Part V the worked example. Estimated reading time: three to four hours.

The Decision Path. If you have already decided to build and just need the framework call, skip Part I, skim §2.1, read Part III for the candidates on your shortlist, then jump to Part IV (decision frameworks) and Part V (the Hestiia recommendation as a worked template). Ninety minutes.

The Audit Path. If you have a stack already and want to pressure-test it, read Part IV — especially the disaster catalog and early-warning signs — and Appendix A (vendor pricing index). Use the disaster scenarios as a quarterly audit. Thirty minutes.



Part I — Theory and Foundations

What is an Agent Farm

1.1 What is an Agent Farm

The phrase agent farm has earned a specific, narrow meaning in 2026. It is not a chatbot. It is not a single hosted assistant. It is a fleet of long-running, mostly-autonomous LLM workflows that consume events from your business systems — a Slack message, a Pipedrive deal stage change, an inbound email, a webhook from your IoT estate — and produce side effects: a Slack reply, a CRM field update, a calendar invite, a database row, an outbound email. Each agent has a domain (the sales-followup agent, the meeting-prep agent, the RFP-drafting agent) and they share infrastructure: the inference billing relationship, the integration layer, the observability pipeline, the evaluation harness, the durability runtime.

The “farm” framing matters because it forces the right questions. If the question is “what framework should we use to build one agent,” the answer is whichever framework your most senior engineer is fastest in. Frameworks are choices for individual agents. The farm is the shared substrate, and the shared substrate is what determines whether the second agent costs as much to ship as the first or whether it costs a tenth as much. The economics of an agent farm are not the economics of any individual agent.

In Hestiia’s case, the existing CLAWD-SALES-AGENT is a single agent that already exhibits farm-shaped properties: a four-step pipeline (triage, context bootstrap, orchestrator, insight update), sub-agent dispatch, persistent state, multiple inbound event sources (Slack, Pipedrive webhooks), and a persistent insight database. It is, architecturally, a farm of one. The next two agents — the CCTP analyzer, the executive briefing agent — will share its substrate. The decision in front of Hestiia is not “what do we build the next agent on” but “what substrate do we commit to for the next three years.”

The Agent Loop

1.2 The Agent Loop

Every agent, regardless of framework, regardless of language, regardless of vendor, runs the same loop:

  1. Perceive. Read inbound state — the event that triggered this run, the conversation history, the relevant data from external systems.
  2. Reason. Call a model. Get back either a text response, a tool-call request, or a sub-task delegation.
  3. Act. If a tool was requested, execute it. If a sub-task was requested, spawn a sub-agent. If a text response was produced, persist it and emit it.
  4. Observe. Append the action’s result to state. Persist a checkpoint. Emit a trace span.
  5. Loop or terminate. Decide whether to call the model again or stop.

This loop is so simple that the entire agent-frameworks industry is, in its first-order function, a collection of opinions about how to spell it. Mastra spells it as a typed Workflow with Step primitives. LangGraph spells it as a StateGraph with nodes and conditional edges. PydanticAI spells it as Agent.run_stream. The Anthropic Agent SDK spells it as an Agent with tools and an internal harness. The OpenAI Agents SDK spells it as Runner.run over an Agent config. Strands spells it as a model-driven loop with hooks. Cloudflare Agents spells it as a Durable Object method.

A working heuristic: if a framework’s spelling of the loop is more than fifty lines of code thicker than the model SDK’s bare loop, you are paying for opinions that will turn into liabilities. The best frameworks have negative abstractions — they remove code you would have written rather than adding code you would not have. Mastra and PydanticAI sit on the right side of this line in 2026. CrewAI’s role-based abstraction sits on the wrong side for any agent that is not literally a panel of role-playing experts.

The second-order function — what frameworks actually compete on, once they have spelled the loop — is everything around the loop: durability, sub-agent spawning, state persistence, tool dispatch, observability instrumentation, eval hooks, prompt management, deploy targets. The first-order spelling matters for week two. The second-order plumbing matters for month twelve.

Statefulness, Durability, and the Hard Problem

1.3 Statefulness, Durability, and the Hard Problem

The hard problem in production agent infrastructure is not the model call. The model call is a stateless RPC: send a request, get a response, bill in tokens, log a trace. The hard problem is the orchestration around the model calls.

Three properties make orchestration hard.

State is mandatory. An agent without state is a chat completion. An agent with state can resume after a tool call that took forty-seven seconds, branch on a multi-step plan, learn from prior runs in the same conversation, and survive a process restart in the middle of a twelve-step workflow. State is what separates agents from completions. State is also expensive: every step needs to be persisted, the persistence layer needs to be queryable for debugging, and the schema needs to evolve as the agent does.

Tool calls fail. Network errors, rate limits, stale credentials, schema changes in the upstream API, inflight deployments, IAM permission drifts, regional outages. A naive agent loop that does not handle tool-call failure with first-class durability will, at production volume, lose 0.5–2% of runs to causes that have nothing to do with the model. At 50,000 runs per day, that is 250–1,000 silently lost runs daily, and the losses concentrate in your most expensive agents because expensive agents make more tool calls.

Long horizons are the norm. A meaningful sales-followup agent runs for hours, not seconds. It waits for a reply to a Slack thread, queues a follow-up for next Tuesday, fires only when the CRM stage changes. A meaningful diagnostic agent waits for a field technician’s confirmation. A meaningful RFP agent waits for a human in the loop to approve the draft. The naive “run a Python coroutine that calls the model in a loop” pattern dies the moment the agent needs to suspend for more than a process lifetime.

The engineering response to these three properties is durable execution: a runtime that lets you write the agent loop as if it were a single coroutine but persists every step to a backend so that a process crash, a deploy, or a multi-day suspend resumes the agent on a different worker exactly where it left off. Temporal, Restate, DBOS, Inngest, Trigger.dev, AWS Step Functions, Cloudflare Workflows, and the durable-object pattern in Cloudflare Agents are all flavors of the same idea: the runtime is the product. Part II §2.3 and Part III §3.10 dive deeper.

A 2026 production agent farm without a durable execution layer is an outage waiting to be scheduled. Hestiia’s current CLAWD-SALES-AGENT relies on FastAPI plus SQLite plus Python’s debounce handling. That works at ten runs per day. It will not work at one thousand.

The Five Architectural Axes

1.4 The Five Architectural Axes

Reduce the entire vendor map to five questions.

  1. Runtime. What spells the agent loop? Mastra, LangGraph, PydanticAI, Strands, the Anthropic Agent SDK, the OpenAI Agents SDK, Cloudflare Agents — these are runtime choices.
  2. Durability. What persists state across crashes, restarts, and multi-day suspensions? Temporal, Inngest, Trigger.dev, DBOS, Restate, Step Functions, Cloudflare Workflows, or “a state machine we built on Postgres ourselves.”
  3. Observability and eval. What captures traces, metrics, and runs evals? Langfuse, LangSmith, Braintrust, Phoenix, Helicone, Logfire, Inspect AI, Promptfoo, or a self-hosted OpenTelemetry pipeline.
  4. Model gateway. How do you route requests to models, manage keys, fall back across providers, and report costs? LiteLLM, OpenRouter, Portkey, Cloudflare AI Gateway, Vercel AI SDK, Kong AI Gateway, or “we call the SDK directly.”
  5. Integration. How do agents reach the outside world? MCP servers (self-hosted via FastMCP, managed via Composio, Arcade, Pipedream Connect), direct API SDKs, or — the last resort — Anthropic Computer Use and Browserbase for systems that have no API.

These five axes are conceptually independent. You can run the OpenAI Agents SDK on Temporal with Langfuse traces and Composio MCP integrations and no gateway. You can run Mastra on Cloudflare Workflows with Braintrust evals and direct SDK calls. You can run PydanticAI on DBOS with Logfire and FastMCP. The combinatorial space is large because the axes do not couple.

But they couple operationally. Some bundles are easier to run (the Anthropic stack bundles runtime and integration tightly through Skills and MCP, and is opinionated about the rest). Some are harder (Mastra bundles runtime, durability via Inngest, and observability through their own product, which raises lock-in concerns). The mature CTO move is to choose each axis with eyes open about which choices are bundled and which are escapable.

A useful exercise: for any framework you are evaluating, ask “if we needed to leave this framework in nine months, what would we keep and what would we throw away?” The answer tells you which axes the framework actually owns versus which it merely happens to ship with. Frameworks that own only the runtime axis are cheap to leave. Frameworks that own runtime, durability, integration, and observability are platforms — and platforms have a different risk profile.

IP vs Plumbing — What You Are Actually Buying

1.5 IP vs Plumbing — What You Are Actually Buying

Every agent farm has two kinds of assets, and they have inverse half-lives.

The IP is the prompts, the agent decomposition, the orchestration shape, the eval harness, and the operational learnings — what tool-call sequence works for which inbound event, which sub-agent to dispatch when, which model handles the triage step well, which guardrails matter, what failure modes have been seen and patched. IP compounds. A team that has run a sales agent in production for a year has IP that no framework gives them.

The plumbing is the framework code, the durable runtime, the trace exporter, the gateway, the deploy pipeline. Plumbing depreciates. The plumbing that was state-of-the-art in early 2024 — LangChain chains, custom retry loops, ad-hoc Postgres state — is largely worthless in 2026. The plumbing of 2026 will be largely worthless in 2028. Bet that your IP outlasts your plumbing by at least 3x.

The implication for CTO decisions is harsh: the question “what framework should we pick” is mostly a plumbing question. If you pick Mastra and it dies in eighteen months, your prompts and agent decompositions survive — they migrate to whatever replaces Mastra. If you pick LangGraph and it pivots, same thing. The frameworks that try hardest to bundle IP into their abstractions — CrewAI’s role-based personas, AG2’s group-chat metaphor — are the ones whose death takes the most IP with them.

Hestiia’s CLAWD-SALES-AGENT illustrates this. Its plumbing (FastAPI, SQLite, custom debounce, the claude -p subprocess) is replaceable in two to three weeks. Its IP — the four-step pipeline shape, the Haiku-then-Sonnet routing decision, the insight database schema, the change_source anti-loop discipline, the Slack Block Kit reporting style, the 668 tests that encode “what good looks like” — is what the team has actually built. The reimplementation question is a plumbing-replacement question. The IP transfers.

This is also why “vibe-coded by a non-developer” is a manageable provenance, not a fatal one. They built valuable IP and disposable plumbing. A professional rewrite preserves the IP and replaces the plumbing. The risk would be a rewrite that also throws away the IP — by changing the prompt structure, the agent decomposition, or the routing decisions because the new framework “expresses things differently.” That is the rewrite to refuse.

The Hestiia Frame

1.6 The Hestiia Frame

The book treats Hestiia as the worked example because the company has a shape that recurs across the mid-market: a 20-person team, polyglot stack (Rust embedded, NestJS cloud, Python tooling, React Native), AWS-committed infrastructure, an existing prototype agent in production, an existing MCP integration discipline (pipedrive-managed), and a CTO who needs the answer to be defensible to both a skeptical IT lead and a board that has heard about agent farms but not paid attention to the second-order details.

The decision Hestiia faces is not greenfield. It is evolution: take CLAWD-SALES-AGENT’s IP, rebuild the plumbing on a substrate that operates well, scales horizontally, instruments observability natively, durably executes long-horizon flows, and survives the next two years of vendor churn. The recommendation in Part V is concrete enough to argue with: a primary stack, an alternative if the primary’s premise breaks, and a migration sequence that preserves the IP while replacing the plumbing.

The same frame works for any reader whose situation is “we have a prototype, we want a professional substrate, we want it to age well.” The framework choices change. The axes do not.


Part II — The Component Stack

Anatomy of an Agent Farm

2.1 Anatomy of an Agent Farm

By the time you finish Part I you have a working definition of “the agent loop” and a working definition of “a farm.” This chapter introduces the second mental model the rest of the book will keep returning to: the five-layer stack that every production agent farm decomposes into, regardless of vendor.

The five layers are runtime, durability, observability, gateway, and integration. Picture them stacked vertically. The runtime is where the loop runs. Durability is the substrate that keeps the loop alive across crashes, deploys, and multi-day suspensions. Observability watches everything the runtime does and turns those observations into traces, metrics, and evals. The gateway brokers calls to model providers. The integration layer — increasingly synonymous with MCP — is how the agent reaches into the rest of the world.

The single most important property of this decomposition is that decisions on each axis are independent. You can run Anthropic’s Agent SDK as your runtime, DBOS as your durability, Langfuse as your observability, no gateway at all, and FastMCP-built servers for integration. Or you can pick LangGraph, Temporal, LangSmith, LiteLLM, and Composio. Or any of several thousand other coherent combinations. None of these tools demands the others; the layers communicate through narrow, mostly-standardised contracts (OTel spans, MCP tool calls, OpenAI-shaped HTTP endpoints).

This is precisely why the temptation to pick a “framework” that bundles them is dangerous. The vendors most aggressively pitching the all-in-one story — LangGraph Platform, Mastra Cloud, OpenAI’s Agent Builder, AWS AgentCore — are doing so because bundling is how they monetise. Each individual layer in their bundle is rarely the best in its category, and the combined lock-in is multiplicative. A framework decision should bind you on at most one layer at a time.

The pattern that has held up across the teams I’ve seen ship and the teams I’ve seen rewrite is this: pick observability first, durability second, integration third, gateway fourth (or never), runtime last. Runtime decisions look the most consequential and turn out to be the most reversible; observability decisions look the most cosmetic and turn out to be the hardest to undo, because the lock-in is in the data you accumulate, not the vendor that holds it. Durability sits second because workflow code embeds business semantics that do not lift-and-shift between engines once a year of state has accumulated.

Why the order looks backwards. It inverts the visible cost gradient. Runtime decisions are loud — they show up in code reviews, hiring debates, and the organisation’s vocabulary; “we’re a LangGraph shop” is a sentence people say. Observability decisions are quiet — they show up once, in a diagram, and then disappear into an environment variable. Permanence runs in the opposite direction of visibility because permanence is decided by what compounds, not by what is talked about.

Observability compounds with data, and the data is the moat. A trace platform’s real lock-in is not the platform — it is the three to nine months of production traces, the eval dataset built from real failures, and the cost-per-conversation history that lets you say “this prompt change saved four thousand dollars a month.” Replacing one trace backend with another is a couple of engineer-weeks. Replacing the dataset is impossible — you cannot regenerate last quarter’s customer interactions, and the human labels on the spans you do have cost real money to produce. The escape valve is to instrument in a vendor-neutral way from day one so you can dual-write or swap backends without abandoning the corpus; the trap is to instrument straight into a vendor’s bespoke shape. That choice is made on day one and is what decides whether the lock-in is real later.

Durability compounds with workflow code that embeds business semantics. A durable workflow is not “Python that retries.” It encodes which boundaries are durable, which steps are idempotent, what counts as one logical operation, and how the engine replays the function on restart. The major engines look isomorphic on a slide — step, activity, workflow — but the details that decide whether a port is mechanical (signal semantics, replay determinism, versioning, draining in-flight workflows during cutover) differ enough that a year-old Temporal codebase does not lift-and-shift to DBOS. A team Hestiia’s size accumulates twenty to sixty durable workflows over two years, plus whatever state is mid-execution at the moment of the swap. That is an engineer-month port at the bad end and a multi-month migration if in-flight state is large. Painful and not catastrophic — durability is second, not first — but the rewrite is real, and that is why a CTO who has run Temporal in production will tell you to pick the engine carefully.

Integration compounds with surface area, and the surface area is portable only if the abstraction is clean. Each tool an agent learns to use ships a contract — a tool definition, the prompt patterns that train the agent to invoke it, the eval cases that gate the behaviour, the customer expectations that depend on the result. Twenty integrations after two years is a typical small-team footprint, twenty engineer-weeks of accumulated work. An MCP-first discipline keeps that surface area portable: swapping a managed integration vendor for a self-hosted server is a configuration change because both expose the same protocol. Embedding a third-party SDK directly in agent code is the discipline that destroys portability. The abstraction is what survives a runtime swap. The vendor inside the abstraction is mechanical.

Gateway is fourth-or-never because most teams do not need one. A small company on a single provider gains almost nothing from a proxy and pays a real latency tax. If you skipped the gateway, there is nothing to undo. If you adopted one and want to swap, change the base URL and redeploy. The rare exception is a year-two team that has built deep on a specific gateway feature — spend tracking, guardrails — but by then the more permanent layers are already settled.

Runtime is last because the loop is small. Prompts are portable. Tool definitions are portable, especially through MCP. Sub-agent topology is portable across every runtime that lives inside your codebase, because they all express it through similar primitives. A single agent ports from one such runtime to another in one to four engineer-weeks. The exception is the runtimes that push the loop out of your process and into a vendor’s control plane — those are not reversible at the same cost, and that distinction is the subject of §2.2. The real warning is this: the runtime decision is the least permanent unless you pick a runtime that makes the layers below it harder to swap. The all-in-one platforms — LangGraph Platform, Mastra Cloud, Managed Agents — monetise by inverting this ordering. They sell you the most-reversible layer first and use it as the wedge to lock the layers below.

The operator’s instruction. Year one: pick observability with vendor-neutral instrumentation, commit to MCP-first integration, defer the rest. Year two: add durability when the first multi-day workflow becomes a real product requirement; pick the engine carefully because it is the second-stickiest decision in the stack. Year three: revisit gateway when a second provider, more than five deployed agents, or a compliance audit makes virtual keys non-negotiable. Pick runtime last and lightly, with the awareness that any framework that bundles the layers below it is selling you reversibility you will later try to claw back.

The Agent Runtime

2.2 The Agent Runtime

The runtime is the thing that executes the loop. As we established in §1.3, the loop itself is small: model call, parse tool requests, run tools, feed results back, repeat until the model stops asking. A “runtime” is the scaffolding around that loop — tool dispatch, message-history management, sub-agent spawning, hook points for policy enforcement, and the API surface your application code talks to.

Every runtime in the market today falls cleanly into one of two camps. The split is more important than which framework wins.

Camp one: agent-code-runs-anywhere. You import a library. The library defines an Agent class (or Workflow, or Graph); you instantiate it; you call .run() from inside whatever process is convenient — a FastAPI handler, a Lambda, a Cloudflare Worker, a Celery task, a desktop CLI. The library does not care where it’s executing. State lives wherever you tell it to live. Mastra, LangGraph, PydanticAI, the OpenAI Agents SDK, Anthropic’s Agent SDK, and Strands all sit in this camp. The contract is the agent is a Python or TypeScript object you own.

Camp two: compute-provider-runs-the-agent. You define the agent declaratively (a config blob, an API call, a console form), hand it to a managed service, and the service owns the process. Anthropic’s Managed Agents (April 2026 beta) is the cleanest example: you create an Agent, an Environment, and a Session, and the session runs on Anthropic’s infrastructure with Anthropic-managed sandboxing, checkpointing, and credential storage, billed at $0.08/session-hour on top of token costs. Cloudflare Agents (Durable Objects as agent sessions, $5/mo Workers Paid plan as the floor), AWS Bedrock AgentCore, OpenAI’s Responses API in its agent mode, and Vercel’s hosted agent surface all sit here. The contract is the agent is an entity in the vendor’s control plane.

The difference is not “self-hosted vs SaaS.” Both camps have SaaS and self-hosted variants. The difference is who owns the process boundary. Camp one keeps the process inside your VPC, your CI, your laptop, your codebase. Camp two pushes the process out and exposes it through a session API.

This matters for three reasons that compound:

First, observability semantics differ fundamentally. In camp one, your existing OTel pipeline ingests agent spans like any other span; the runtime is just code. In camp two, your traces live in the vendor’s console first and only escape via export hooks (which may or may not exist, may or may not be real-time, and almost always lose detail).

Second, debugging looks different. In camp one, you set a breakpoint. In camp two, you read the vendor’s event stream and replay a session in their UI.

Third, the lock-in profile is asymmetric. Camp-one runtimes keep the agent definition (prompts, tool list, sub-agent topology) as code in your repo; porting it to another camp-one runtime is mechanical. Camp-two runtimes encode that definition into proprietary control-plane primitives (Sessions, Environments, Workspaces, Threads); porting requires a control-plane rewrite. Anthropic’s Agent SDK is roughly 70% portable across harnesses; Anthropic’s Managed Agents is roughly 10% portable.

Camp two is genuinely better for some workloads: hour-plus session durations, server-managed credential isolation, sandbox environments your team would otherwise have to operate. Camp one is better for almost everything else, and especially better for organisations whose CI, deploy, and observability discipline is already strong on conventional services.

For Hestiia specifically, the existing CLAWD-SALES-AGENT runs claude -p as a subprocess inside a FastAPI handler — that is squarely camp one, and the natural professional evolution is the Anthropic Agent SDK on API-key billing (still camp one), not Managed Agents. The session-hour line item makes Managed Agents uneconomical for minute-scale sales-agent events, and the loss of dev/prod symmetry — what runs in my terminal runs in CI — is a real regression for a small engineering team.

The runtime decision deserves the least religious commitment of any decision in this book. A working camp-one agent ports to another camp-one framework in one to four engineer-weeks; the prompts, the tool definitions, and the sub-agent topology are genuinely portable. The reason to be careful is not that runtime choice is permanent — it is not — but that camp-two runtimes silently make the other layers harder to swap.

Durable Execution

2.3 Durable Execution

Durability is the layer most teams discover the hard way, around month four of running agents in production. The first symptom is always the same: an agent run crashed mid-flight, the user got a half-finished result, the engineer who has to debug it stares at a stack trace that ends inside a tool call to a third-party API, and nobody can tell whether the side effect happened.

Durable execution is the discipline of making that question answerable by construction. The promise is small but specific: every step of an agent’s work is recorded to durable storage before it runs, every result is recorded after it runs, and on crash the runtime resumes from the last recorded point with exactly-once semantics for tool calls. That is what separates “durability” from “retry with exponential backoff.” A retry policy says “try again if the request fails.” A durable execution engine says “the workflow state is in Postgres; the process can die at any instruction; on restart, we replay the journal, skip the steps we already finished, and continue.”

The four properties that come with this guarantee are the ones that matter for agents:

  • Resumable workflows. The agent stops and starts on a different machine, possibly days later, without losing context.
  • Mid-flight crash recovery. A pod restart in the middle of a tool call does not double-charge the user’s credit card or double-send the Slack message.
  • Multi-day suspensions. A “wait 24 hours then check if the deal advanced” step is a first-class primitive, not a cron hack.
  • Exactly-once tool calls. Idempotency is enforced by the engine, not by every tool author hand-rolling dedupe keys.

Ad-hoc retry-and-exponential-backoff buys you none of these. It buys you “the failed HTTP call will be retried.” It does not buy you “the agent that was halfway through a five-step Pipedrive update will resume coherently after the pod was OOMkilled.”

Four architectural patterns dominate the 2026 market.

Temporal-style external orchestrator. A separate cluster (Frontend / History / Matching services backed by Cassandra, MySQL, or Postgres) owns the workflow journal. Your code becomes a “worker” that long-polls task queues over gRPC. Workflows are written as plain code; the engine records every external call as an activity and replays the workflow deterministically on restart. Temporal itself is the category leader (MIT-licensed server, Cloud from $100/mo Essentials with 1M actions included, $50/M overage tapering to $25/M at scale). Restate (BUSL 1.1, single Rust binary, $75/mo Starter for 5M actions) is the modern entrant — same guarantees, dramatically simpler ops because everything runs in one process backed by embedded RocksDB. Both are appropriate when workflows run for days, when you have multi-language services, and when you have an engineer to own a separate cluster.

Postgres-native, in-process library. DBOS (founded by Mike Stonebraker and Matei Zaharia, MIT-licensed library, Conductor control plane $99/mo Pro) takes the opposite stance: no separate cluster, no broker, no new infrastructure. You decorate Python or TypeScript functions with @DBOS.workflow and @DBOS.step, and DBOS records each step into a dbos schema in your existing Postgres, in the same transaction as your business writes. The killer property is that workflow state and business state can be written atomically in one transaction. For any team already on Postgres — which describes almost every backend in 2026 — DBOS is the lowest-blast-radius way to add durability.

Event-driven, serverless-shaped. Inngest and Trigger.dev both ship a developer experience built around “your workflow is just async functions; we run them as durable steps with retries, schedules, fan-out, and a UI.” Inngest has a Python SDK with deep AI primitives (their AgentKit ships TS-only but the durable-step engine is language-agnostic via HTTP). Trigger.dev v4 is TypeScript-first with its own AI tasks layer. Both have generous free tiers (Inngest free under roughly 50k function runs/mo, paid from $20/mo) and aim at the “small Python team with HTTP-shaped workloads” sweet spot. The trade-off is that the SaaS retry budget and the per-step ceiling (Inngest’s serverless steps must complete in under five minutes by default) eventually force a migration to Temporal-shaped infra at large scale.

Hyperscaler workflows. AWS Step Functions ($25/M state transitions for Standard, runs up to a year), Cloudflare Workflows + Durable Objects (folded into the $5/mo Workers Paid plan with 10M requests included), Google Cloud Workflows. These are appropriate when you are already deeply committed to one cloud and want the workflow engine on the same bill as everything else. They are weaker than Temporal/Restate at developer ergonomics and weaker than DBOS at “workflow state next to business state,” but they win on procurement.

The trap to avoid is conflating durability with the agent runtime. The new wave of agent frameworks (LangGraph Platform, Mastra Cloud, OpenAI Agent Builder) bundle a thin durable execution layer into their runtime, and the new wave of durable execution vendors (Inngest, Trigger.dev, DBOS, Restate) bundle thin “agent kits” into theirs. Both are inferior to the dedicated tool in the other category. The pragmatic architecture for any team taking durability seriously is to pick a durable execution engine first as multi-year infrastructure, then plug whichever agent framework wins this quarter inside its workflows. Temporal, Restate, and DBOS keep this separation surgically clean. Inngest and Trigger.dev blur it deliberately, which is fine if you accept that the bundling decision is the lock-in.

For Hestiia at current volume — webhook-driven sales agent, single-language Python team, Postgres already in production via the NestJS backend — DBOS is the boring-correct answer. It imports into the existing FastAPI app with no new deployment unit. Plan a migration option to Restate or Temporal at the 10M-action/month threshold or the moment a workflow that genuinely runs for days becomes a real product requirement.

Observability and Eval

2.4 Observability and Eval

The single line a CTO needs to internalise about observability is this: you can change agent frameworks every six months, but you cannot change observability that often, because the value compounds with the data you have already collected. Three months of production traces, a curated eval dataset built from real failures, and historical cost-per-conversation tracking are the asset. The framework that produced them is replaceable.

Most teams underweight this because, on day one, observability looks like a thin “log the prompt and response” layer that any junior can write in an afternoon. That works until you need three things simultaneously, and they pull in different directions:

  1. Traces — multi-step agent debugging. Span trees, tool-call inputs and outputs, timing, replay-from-any-step.
  2. Evals — turning real production traces into regression datasets, running LLM-as-judge or human-scored experiments, gating prompt and model changes in CI.
  3. Cost attribution — per-agent, per-customer, per-feature dollar accounting, with token-level granularity tied back to org and user.

Tools that do (1) brilliantly often treat (2) as a bolt-on. Pure eval shops treat (3) as nice-to-have. A FinOps tool does (3) but cannot replay an agent. The decision is which of the three is the dominant pain — and whether you accept being locked to one vendor for all three or compose them.

The OpenTelemetry GenAI convergence is the portability hedge

The most important development of the last eighteen months is that the OpenTelemetry community now publishes GenAI semantic conventions — a standardised way to describe an LLM call as an OTel span (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens), span events for prompt and completion content, plus separate conventions for agent spans (gen_ai.agent.*) and tool-call spans. As of April 2026 the conventions are still officially “experimental” but stabilising fast, with explicit conventions for OpenAI, Anthropic, Azure AI Inference, AWS Bedrock, and the Model Context Protocol.

The adoption inflection point has arrived. Datadog began native GenAI semconv ingestion in OTel collector v1.37. New Relic launched OTel-native AI Monitoring in February 2026. Honeycomb is OTel-native by design. Logfire is built on OTel from day one. Phoenix’s OpenInference instrumentation libraries emit spans that comply with the semconv. Langfuse added a native OTLP endpoint in 2025; Braintrust accepts OTel; LangSmith now ingests OTLP.

The hedge is: instrument once, export anywhere. If your code emits OTel-compliant spans (via OpenLLMetry, OpenInference, or framework-native OTel like PydanticAI’s), you can dual-write to Langfuse and Datadog tomorrow, swap to Phoenix the day after, and keep your traces portable. The cost is one extra abstraction layer; the payoff is escape velocity from any single vendor. Lock-in becomes an instrumentation choice, not a vendor choice. For any company that holds infrastructure decisions for a decade, this matters more than any feature comparison.

Three sub-categories, three buying decisions

The market has cleanly split into three product shapes. A serious agent farm will end up using one from each category, or one platform that genuinely covers all three.

Trace platforms. Langfuse (MIT-licensed core, ClickHouse-acquired, $0/$29/$199/$2,499 per month tiers, free self-host that is genuinely free with no open-core trick), LangSmith (LangChain Inc., $39/seat/mo Plus, tightest LangGraph integration, weakest outside it), Phoenix (the trace UI is ELv2-licensed — source-available, free for internal use, restricts only redistribution as a managed service — paired with OpenInference which is Apache-2.0 and is the OTel-aligned instrumentation library; AX Pro at $50/mo for the managed instance), Helicone (Apache-2.0 proxy-based observability — in maintenance mode following Mintlify’s 2026-03-03 acquisition; not a 2026 procurement option, the slot is now LiteLLM), Logfire (Pydantic, OTel-pure, Personal tier free for 10M spans/mo and $49/mo Team for the next bracket).

Eval platforms. Braintrust (closed-source, premium, $249/mo Pro plus $3/GB after 5 GB, the best eval-velocity loop on the market with their “Loop” autonomous prompt-edit agent), LangSmith Datasets (only worth it if you are committed to LangChain), Inspect AI (UK AI Safety Institute, MIT, the de facto standard for safety evals — Apollo Research and METR use it, free), Promptfoo (OSS, GitHub-Action-friendly, free). The OSS pair — Inspect AI plus Promptfoo — is the missing default for any team running self-hosted Langfuse.

Cost / FinOps. Generic FinOps tools like Vantage with LLM-cost connectors, plus the cost dashboards built into Langfuse, Logfire, and Braintrust. Cost attribution is increasingly table-stakes inside the trace platforms; standalone FinOps for LLMs has not coalesced into a category leader. LiteLLM’s spend tracking is the simple OSS option if the gateway is already on the path.

Why eval-as-code matters by 2028

A specific prediction: by 2028, prompt and model changes that ship without a CI-gated eval run will be treated the way deploys without tests are treated today — as a process failure, not a stylistic choice. The teams that win are the teams whose eval datasets are versioned in git alongside the prompts they evaluate, whose CI pipelines block merges when scoring regresses, and whose production traffic is sampled into the eval dataset automatically. This is what Braintrust productises and what Inspect AI gives you for free.

The reason this matters now, not in 2028, is that the data is the moat. Curating an eval dataset takes three to nine months of accumulated production traces and human-judged outcomes. A team that starts in 2026 has a usable dataset by 2027; a team that starts in 2027 spends 2028 catching up. Pick observability now, with eval-as-code in mind, even if you ship the first eval six months later.

For a Hestiia-shaped team — Python-first, Postgres-native, infra-conservative, roughly 3M traces a year at most — the right composition is Langfuse self-hosted as the primary trace backend (~$1,200/yr infra), OpenInference instrumentation so Phoenix is always a free escape hatch, Inspect AI for offline regression evals, and Helicone-as-proxy skipped given the maintenance-mode uncertainty. Total cost ceiling: roughly $1,500/yr plus an engineer-week of setup. Anything in the Datadog-LLM-Obs or LangSmith-Plus range is paying for ecosystem fit, not capability.

The Model Gateway

2.5 The Model Gateway

A model gateway is a network hop — or, in its lightest form, a library — between your code and the LLM provider. Three real reasons exist for deploying one:

First, decouple model code from provider. Your application calls one OpenAI-compatible endpoint; the gateway handles whatever Anthropic, Bedrock, Vertex, or local-vLLM is on the other side. Second, cache, route, and fail over at the infrastructure layer: prompt caching, semantic caching, automatic retries, model fallbacks, load balancing across regions and providers. Third, cost, policy, audit: per-team virtual keys with budgets, PII redaction, prompt-injection guardrails, request logs that satisfy SOC2 / ISO 27001 evidence requirements.

The honest answer for most teams reading this book is that you do not need a gateway today. A 20-person company on a single provider, calling Claude or GPT from a handful of internal tools, gains very little from inserting a proxy and pays a real latency tax for the privilege. The trap is symmetric: not “we picked the wrong gateway” but “we built our app so coupled to @anthropic-ai/sdk that adding a gateway later is a two-week refactor.”

The right move is the cheap abstraction at the code layer so the swap stays cheap when it eventually happens. Three patterns split the market.

SDK-level abstractions. The Vercel AI SDK (TypeScript, the ai npm package) gives you generateText, streamText, structured output, and tool use with a unified API across providers. LiteLLM as a library (Python) does the same — drop-in replacement for the openai SDK, supports 100+ providers, no separate process to operate. This is the “one wrapper file” answer. For Hestiia today this is the correct stopping point: standardise on PydanticAI’s provider abstraction or wrap Anthropic calls behind a thin internal interface and stop.

Proxy gateways. LiteLLM in proxy mode (MIT, ~240M Docker pulls, free self-host with roughly $500–3,000/mo real cost when you count Postgres, the container host, and DevOps time; Enterprise tier custom for SSO and audit logs). Portkey (MIT-licensed open core, hosted Production tier $49/mo for 100K logs, then $9 per additional 100K up to 3M; Enterprise custom). Cloudflare AI Gateway (free on every Cloudflare plan, with the gotcha being log quotas: 100K/mo on Workers Free, 10M per gateway on Workers Paid, then Logpush at $0.05/M). The proxy adds a network hop and a piece of infrastructure to operate; the payoff is virtual keys, spend tracking, fallback routing, and per-team budgets without touching application code.

Enterprise API gateways. Kong AI Gateway, Apigee with AI plugins. Best fit is companies already running Kong or Apigee for their REST APIs and looking to extend the same governance to LLM traffic. For greenfield AI-only deployments LiteLLM and Portkey are lighter and cheaper.

OpenRouter deserves a mention as the hosted-aggregator option — point your client at openrouter.ai/api/v1, top up credits, call any of ~300 models behind one OpenAI-compatible API, pay a 5.5% fee on credit purchases. Right answer for prototyping and hobby projects, wrong answer for a 20-person company that already has direct provider contracts.

The decision rule is straightforward. Re-evaluate gateway when any of the following lands: a second provider becomes part of the production stack, more than five deployed agents need shared budgets and keys, or compliance asks for centralised audit logs. At that point default to LiteLLM proxy (if you want OSS and self-host) or Cloudflare AI Gateway (if you are already on Cloudflare). Until then, one wrapper file is enough.

The MCP and Integration Layer

2.6 The MCP and Integration Layer

The Model Context Protocol won the integration war by being the only credible standard at the moment the agent-tool problem became acute. Every major framework ships an MCP client; every major SaaS company is shipping MCP servers; the legacy “OpenAPI plugin” approach is dead in everything except niche corners of Microsoft’s ecosystem.

The reason MCP won is structural rather than political. An agent’s tool list is fundamentally a runtime concern: which tools are available depends on the user’s permissions, the current task, the agent’s sub-agent topology, and the model’s context budget. A static OpenAPI spec describes a service; MCP describes a runtime tool inventory — which tools exist, what their descriptions are, what schemas they accept, what resources they expose. That distinction is small on the page and load-bearing in production.

What the 2025-11-25 spec actually changed

The latest spec is 2025-11-25, released November 25, 2025. The political event was December 2025: Anthropic donated MCP to the Linux Foundation (under a directed fund commonly referenced as the Agentic AI Foundation; the source material is inconsistent on the exact entity name, so verify before quoting in any board document). MCP is no longer a single-vendor standard — it has Core Maintainers, a contributor ladder, and four 2026 priority areas (transport scalability, agent communication, governance maturation, enterprise readiness).

Three substantive additions matter for production:

OAuth 2.1 with PKCE is now mandatory for public remote servers. Dynamic Client Registration is replaced by Client ID Metadata Documents (CIMD): clients publish a JSON file at an HTTPS URL and that URL is the client ID. RFC 8707 Resource Indicators are now required, so tokens are bound to a specific MCP server and cannot be replayed against another. CIMD has been in the wild for roughly five months; expect rough edges in implementations through 2026.

Tasks primitive (experimental). A first-class abstraction for long-running operations, designed for agent workflows where a tool call legitimately takes minutes or hours (rendering, batch processing, human-approval steps). Until Tasks lands GA, the workaround is to make tools return a job ID and require the agent to poll — clunky but workable.

Transports. Two are blessed: stdio (local processes) and Streamable HTTP (with SSE for server-initiated messages). The legacy “HTTP+SSE” two-endpoint transport from the 2024 spec is deprecated. Anything you build today should be Streamable HTTP for remote and stdio for local.

Three product shapes

The MCP ecosystem has split into three distinct buying decisions, and conflating them is how teams over-spend.

Self-hosted MCP servers. You write the server yourself. The dominant Python framework is FastMCP (prefecthq/fastmcp, roughly 70% of Python MCP servers, ~1M daily downloads; FastMCP 3.0 shipped January 2026 with component versioning, granular authorization, OpenTelemetry instrumentation, and multi-provider OAuth). The dominant Rust option is rmcp. The dominant TypeScript option is the official @modelcontextprotocol/sdk. Self-hosted is the right answer for any internal tool that talks to your own systems — your existing CRM wrapper, your internal databases, your bespoke business logic. Hestiia’s pipedrive-managed and the internal MCP Manager fall in this bucket.

Managed MCP runtimes. Composio (~250+ apps, $29/mo Hobby for 200K tool calls, $229/mo Business for 2M calls, Enterprise custom for SOC-2 and VPC), Arcade.dev (Composio’s most direct competitor, $25/mo Growth tier with hosted MCP servers at $0.05/hr, startup program for sub-100-employee companies), Pipedream Connect (~3,000 APIs and 10,000+ tools, the breadth play), Smithery (the registry/marketplace, free to list and free to install, hosted execution at usage-based pricing). These earn their keep when you need dozens of SaaS integrations with end-user OAuth — multi-tenant agent products where each end-user has to authorise their own Google Calendar, Gmail, HubSpot, Notion, Linear, Salesforce. Building per-user OAuth at that breadth in-house is a six-month project; Composio or Arcade is one API call.

MCP gateways (emerging category). The gateway pattern from the model layer (rate-limit, fail-over, virtual keys, observability) is starting to appear for MCP. Speakeasy’s Gram, Docker’s MCP gateway proposal, and TrueFoundry are early entrants. The category is not yet mature enough to have a clear leader, and the value proposition only crystallises at large enterprise scale (hundreds of MCP servers, multi-tenant routing, central audit). For most teams in 2026 this is a watch-do-not-buy category.

Code Mode is the architectural pattern that will outlast the spec details

Cloudflare introduced Code Mode in early 2026, and it is the most important MCP design pattern of the year regardless of who hosts your servers. The naive MCP integration dumps every tool definition into the model’s context: a 2,500-endpoint API can chew through 1.17M input tokens before the agent has done any actual work. Code Mode inverts the contract. The server exposes two tools — search() and execute() — and a tool-spec resource. The agent queries the spec when it needs to understand what is available, then writes small TypeScript snippets that the Worker runs in a sandboxed isolate. Cloudflare’s measured result on their own ~2,500-endpoint API: input tokens dropped from 1.17M to ~1K, a 99.9% reduction.

The pattern generalises far beyond Cloudflare. Any MCP server with more than ~30 tools should consider exposing a Code Mode interface alongside (or instead of) the per-tool list. The cost of the sandboxed isolate is real (this is what §2.8 is about) but the context savings dwarf it once tool counts go past triple digits.

The integration discipline as a high-confidence bet

The MCP-first integration discipline — every external tool is an MCP server, no agent talks to a third-party API directly — is the highest-confidence architectural bet a CTO can make in 2026. Frameworks come and go; MCP is now Linux Foundation infrastructure with multi-vendor SDK commitment. A tool defined as an MCP server today works against Anthropic’s SDK, OpenAI’s Agents SDK, mcp-use, LangChain, LangGraph, Mastra, PydanticAI, and Cursor. The integration layer is the most portable layer of the stack, and that property is durable.

For Hestiia: keep the internal MCP Manager and pipedrive-managed architecture. Build new internal servers on FastMCP 3 (Python) or rmcp (Rust) with OTel from day one. Reach for Composio or Arcade only when an end-customer-facing agent ships that needs per-user OAuth at SaaS-breadth scale.

Agent Memory

2.7 Agent Memory

Memory is the layer most teams discover by accident, usually when an agent that worked beautifully in week one starts behaving like an amnesiac in week eight. The cause is invariably the same: every relevant fact about the user, the deal, or the prior conversation has been crammed into the system prompt or the most recent few messages, the prompt is now 80,000 tokens long, the model is dropping detail, and “just stuff more context” stops scaling.

Memory is the discipline of storing and retrieving the things an agent should know across runs, beyond what fits in the active context window. It splits into three meaningfully different shapes:

Session state — the working memory of a single conversation: messages, scratch notes, intermediate tool results. Lives for the duration of a thread. Every agent framework gives you this for free, usually backed by the same database that durability uses.

Long-term episodic memory — the agent’s record of past interactions. “We talked about pricing last Tuesday; the customer pushed back on the BET margin.” Episodic memory is the thing that makes a sales agent feel like it remembers the relationship. The hard problem is not storage; the hard problem is retrieval — pulling the right three episodes out of three thousand without re-running an LLM over the whole archive.

Semantic memory — distilled facts about entities, independent of the episodes that produced them. “This MoA prefers Sonnet-quality answers in French.” Semantic memory looks like a CRM with structured fields, except the fields are written by the agent in natural language. This is where prompt-stuffing fails earliest and most visibly.

Two product categories address this problem.

Memory baked into agent frameworks. Mastra Memory (working memory plus semantic recall plus threads plus resources, with pluggable LibSQL/Postgres/Upstash/MongoDB backends), LangGraph’s checkpointer, the OpenAI Agents SDK’s thread storage, PydanticAI’s memory hooks. These are good defaults for the framework you are already using and acceptable for most teams. The risk is benchmark-tuning: Mastra’s memory has been criticised for being optimised against LongMemEval rather than robust to adversarial production traffic. Trust framework memory for the obvious cases; reach for a dedicated tool when memory becomes a load-bearing product feature.

Dedicated memory products. Three names matter.

  • Mem0 (YC, OSS-core, Pro $19/mo, free tier with self-host path) is the developer-ergonomic option — REST API, simple SDK, “just call mem0.add() and mem0.search().”
  • Letta (formerly MemGPT, academic-rooted, OSS) implements the MemGPT paper’s notion of OS-style memory hierarchies, with first-class concepts of “core memory” (always-in-context), “archival memory” (vector-searchable), and “recall memory” (full-text-searchable). The conceptual model is clearer than Mem0’s; the developer experience is rougher.
  • Zep (Cloud Starter $39/mo, OSS Community Edition for self-host) is the most production-mature, with temporal knowledge graphs as the differentiator — facts have validity windows, so “the customer’s preferred BET was Acme until March 2026, then switched to Bravo” is a first-class query rather than a prompt-engineering exercise. For relationship-heavy agents (sales, customer success, account management) Zep’s temporal-graph model is genuinely the right shape.

The decision rule: start with framework-baked memory; graduate to a dedicated product when memory becomes a discriminating product feature, not a debugging aid. For Hestiia’s CLAWD-SALES-AGENT, the existing insight DB schema is already a hand-rolled semantic-memory store; the natural professional evolution is either Mastra’s working memory pattern (if porting to TypeScript) or Zep self-hosted (if staying in Python and treating memory as the load-bearing capability of the agent). Both are credible; Zep is the one to pick if the deal-history question — “what did we already learn about this MoA?” — is the question that defines the product.

Code Sandboxes and Browser Automation

2.8 Code Sandboxes and Browser Automation

The moment an agent needs to execute code it just generated, scrape a page that has no API, or run an arbitrary tool result in isolation, the runtime needs a sandbox. Building this in-house — Firecracker microVMs, container hardening, network egress policies, filesystem snapshotting — is two to four weeks of fragile work that nobody wants to maintain. The market has split this cleanly into two product categories.

Code sandboxes. E2B (the category leader, hosted Firecracker VMs, Python and JS runtimes, Pro $150/mo for 8 hours of concurrent compute, pay-as-you-go ~$0.000014/CPU-sec) is the standard answer to “my agent needs to run code it generated.” Boot a VM in under a second, stream stdout, snapshot, kill. Modal (originally a serverless-compute platform, sandboxes are a feature; pay-per-second, generous free tier) is the right answer if you are already on Modal for ML workloads. Daytona (open-source dev-env-as-a-service, sandbox API as a feature) is the OSS self-host option. Pyodide and Deno in the runtime itself — Cloudflare Workers’ isolates, Vercel’s Edge Runtime — are the right answer when sandboxing requirements are modest (bounded CPU, no network, no filesystem) and the cold-start cost of a real VM is unacceptable.

Browser-for-agents. Browserbase ($40M Series B, the dominant managed headless-browser-for-agents, free tier, ~$0.05/min per session, team plans $99–$499/mo) handles the boring-hard problem: a fleet of pre-warmed headless Chromium instances with proxy rotation, captcha handling, session recording, and anti-bot evasion. Stagehand is Browserbase’s TypeScript framework for writing browser-driving agents. Browser Use (MIT, Python, ~50k stars) is the OSS alternative — same shape, you operate the headless Chrome yourself.

The build-vs-buy decision here is starker than in any other layer of the stack. A code sandbox built in-house is a multi-month project even with Firecracker primitives; a browser fleet built in-house is a permanent on-call burden. Buy E2B and Browserbase unless your scale or compliance posture genuinely demands self-hosting. For Hestiia today, neither is a current need — the sales agent does not generate code that needs running, and Pipedrive’s API covers what would otherwise be browser automation. But the day a CCTP-analyser needs to run user-supplied Python over Pipedrive exports, or a sales agent needs to log into a French BET admin portal that has no API, E2B and Browserbase are the right “buy” defaults.

The architectural pattern that matters: treat the sandbox as a tool exposed via MCP, not as a runtime concern of the agent framework. An MCP server fronting E2B (run_python(code)) and another fronting Browserbase (navigate(url), extract(selector)) keeps the integration layer clean and lets you swap providers without rewriting the agent.

Computer Use and "When There's No API"

2.9 Computer Use and “When There’s No API”

Computer Use is the integration of last resort. When MCP does not exist, when the SaaS has no public API, when the legacy admin portal has not been touched since 2014 and is held together by jQuery and a session cookie, an agent’s last option is to drive the screen the way a human does — see pixels, move the mouse, type characters, read what comes back.

Anthropic Computer Use (a beta tool surface on Claude, billed at standard token rates plus the cost of screenshot tokens — image input meters at the standard input rate, and a single 1280×800 screenshot is roughly 1,500–2,000 tokens) is the production answer in 2026. The OpenAI equivalent (Computer Use Agent in the Responses API) lags slightly on success rate but is functionally similar. Both are token-expensive: a multi-step screen task can easily burn 50k–100k tokens of screenshot input alone, putting cost-per-task in the $0.50–$2 range for Sonnet and several dollars for Opus.

The trade-offs are unambiguous. Computer Use is slow (multi-second per action because every step requires a screenshot round-trip), fragile (a CSS change in the target site breaks the run), and expensive (token-heavy). It is also the only thing that works when there is genuinely no API. The right framing for a CTO is: budget Computer Use for the long tail, never the happy path.

For a sales-agent farm specifically, the calculus matters. Pipedrive MCP plus REST handles 95% of CRM actions. Most B2B SaaS now ships an API. But legacy installer portals, French RE2020 admin tools, supplier extranets — these regularly have no API and no plans to ship one. A sales agent that needs to “log into the supplier portal and download the PO PDF” is a textbook Computer Use task. Browserbase plus Stagehand is the cheaper, faster alternative when the target is a structured web page; Computer Use wins when the target is a Citrix-rendered desktop application, a Flash-era legacy interface, or an OS-level workflow.

Architecturally, Computer Use should be exposed to the agent as one MCP tool, behind a strict allowlist. Treat it the way you would treat raw shell access: powerful, last resort, audited at every invocation. A CTO masterclass that omits this category gives the false impression that MCP plus browser automation covers the integration surface. They cover the typical case. Computer Use covers the case that defines whether a sales-agent farm can touch the bottom 5% of integrations that block real revenue.

Voice Agents

2.10 Voice Agents

Voice agents are a sibling category to the agent farm, structured identically except the I/O substrate is audio rather than text. The architecture is STT → LLM → TTS + telephony, with sub-500ms end-to-end latency as the table-stakes target. Three vendors have crystallised:

Vapi (~$0.05/min plus model and provider passthrough costs, the developer-platform leader), Retell ($0.07–$0.10/min, slightly more polished out of the box), and Bland AI (the most aggressive on outbound calling, opinionated stack, harder to integrate with custom logic). All three handle the boring-hard parts: SIP trunking, real-time speech-to-text streaming, interruption handling, voice cloning, call recording, latency budgeting. Each is “agent farm but for phone calls.”

The reason a CTO masterclass must name them even when the recommendation is “skip for now” is that AI SDR-as-a-service is the single fastest-growing agent product category of 2026, and any executive reader will assume you covered it. The market signal is clear; the technical pattern is clear; the only question is whether the specific company’s go-to-market includes outbound voice. For a hardware company selling through B2B promoter and BET channels, the answer is “not yet, but probably within twenty-four months when the inbound support call volume crosses the threshold where deflection pays for itself.”

For Hestiia, skip in 2026, revisit in 2027 when the device fleet generates enough support calls to make Vapi or Retell a buy-not-build decision. The architectural prep is trivial: any tool exposed via MCP for the text agent works equally for the voice agent, so the integration layer carries forward unchanged.

The Decision Tree

2.11 The Decision Tree

Eleven sections in, the temptation is to read this as eleven independent decisions and feel paralysed. The actual decision tree is shorter than it looks, and the order matters more than the choices.

First, decide observability. This is the most permanent layer; the data compounds; the OTel instrumentation choice survives every other change. For most teams the right answer in 2026 is Langfuse self-hosted with OpenInference instrumentation, Inspect AI for offline evals, Phoenix as the always-free escape hatch. This is a one-week setup and a five-year asset.

Second, decide the integration discipline. The rule is one sentence: every external tool is an MCP server, no agent talks to a third-party API directly. Build internal servers on FastMCP 3 (Python) or rmcp (Rust), reach for Composio or Arcade only when end-user OAuth at SaaS-breadth becomes a real product requirement. This decision is essentially free to defer; the discipline is what matters.

Third, decide durability. If you are already on Postgres, the answer is DBOS for the lowest-blast-radius path. If you are not on Postgres, the answer is Temporal (production scale) or Inngest/Trigger.dev (small Python team, HTTP-shaped workloads). This decision can be deferred until the first real “agent crashed mid-flight, what happened?” incident, but the team that defers past two such incidents is choosing pain.

Fourth, decide the runtime. This is the most reversible decision in the stack and gets the most attention disproportionately. For Python-first teams already on Claude, the Anthropic Agent SDK on API-key billing is the lowest-friction path. For TypeScript shops with a need for a non-engineer prompt-editing UI, Mastra is the strongest option. For Python-first teams that want pure OTel and provider-agnostic typing, PydanticAI plus Logfire is the boring-Python answer. Whichever you pick, the prompts and tool definitions port in days, not weeks.

Fifth, defer the gateway. Until you have a second provider, more than five deployed agents needing shared budgets, or a compliance audit asking for centralised logs, one wrapper file is the right answer. Plan to introduce LiteLLM or Cloudflare AI Gateway at the inflection point; do not build it now.

Sixth, watch memory, sandboxes, browser automation, computer use, and voice as separable buy-not-build decisions. Each becomes urgent at a specific product threshold (memory when deal-history retrieval becomes load-bearing; sandboxes when the agent generates code; voice when inbound support volume justifies deflection). None should be over-architected today. Each has a clear default vendor (Zep, E2B, Browserbase, Anthropic Computer Use, Vapi) when the threshold arrives.

The pattern across all five primary decisions is the same: pick the layer that maximises portability and minimises operational surface, accept that runtime choice is reversible, and protect the layers (observability, integration) where the data compounds and the lock-in is real. Part III walks the specific vendors in the depth this chapter deliberately avoided.


Part III — The Vendor Landscape

Section A — Agent Frameworks and Hosted Runtimes

The frameworks in this section are the ones a CTO actually picks as the primary runtime for an agent farm — the load-bearing decision that determines how agents are defined, how they call tools, where their state lives, and what shape the production stack takes around them. Adjacent stacks that show up under or next to a primary runtime — durable execution standalones (Temporal, DBOS, Restate, Hatchet), observability-only platforms (Langfuse, Braintrust, Arize), MCP infrastructure, browser/sandbox products, voice and memory specialists — live in Section B. The split is deliberate: this section is the orchestration choice; Section B is everything that orchestration choice forces you to also pick. Each chapter follows the same scaffold — what it is, vendor health, killer feature, pricing, fit for Hestiia, the month-12 disaster, and a one-line verdict — and the chapters are sized to importance: the three real candidates (Mastra, LangGraph, Anthropic) get the full treatment; the rest get enough to defend the recommendation against an internal challenger.

Section B — Adjacent Stacks

Section A walked through the primary agent runtimes — the libraries and platforms a team actually imports to define an agent, hand it tools, and hit run(). That is the most visible decision, but it is not the most expensive one. Around the runtime sits a second ring of choices that compound differently: durability, observability, model gateways, MCP infrastructure, low-code adjuncts, and the categories most CTOs forget on the first pass — agent memory, code sandboxes, browser automation, computer use, voice, coding-agent reference products, hosted research, and prompt management. These pieces are easy to swap individually and hard to swap collectively. Section A picks the engine; Section B picks the fuel system, the dashboard, the wiring loom, and the seatbelts. The chapters below walk that ring in order — durable execution first, prompt management last — with pricing and a Hestiia-shaped verdict for each.

Mastra

3.1 Mastra

What it is

Mastra is an open-source TypeScript framework — Apache-2.0 core, plus an Enterprise license that gates only platform-side modules — built around four primitives: a typed Agent (system prompt, model string, tools, memory), an XState-backed Workflow with .then() / .parallel() / .branch() composition and durable suspend/resume, a layered Memory (working memory, semantic recall, threads, resources) with pluggable storage adapters, and Tools that are Zod-typed function-callables, MCP servers, or other agents in disguise. It sits one layer above the Vercel AI SDK (which it uses for the actual model calls) and is positioned as the TypeScript answer to LangGraph: more opinionated than LangGraph, more batteries-included than the Vercel SDK alone, more provider-agnostic than the OpenAI Agents SDK.

Vendor health

Founded October 2024 by three Gatsby.js veterans — Sam Bhagwat, Abhi Aiyer, Shane Thomas — through Y Combinator W25. The funding stack is one of the strongest in the category: $13M YC-led seed in March 2025 (with Paul Graham, Tristan Handy, David Cramer, Guillermo Rauch, Amjad Masad, and Balaji Srinivasan on the cap table), then a $22M Series A led by Spark Capital in April 2026, for $35M total. The repo sits at 23.4k stars, 300+ contributors, ~1.8M monthly npm downloads at v1.0 in January 2026, with @mastra/core at v1.28 by late April 2026. The named production roster is the strongest piece of due diligence in the JS agent space: Replit (Agent 3 is built on Mastra), PayPal, Adobe, Sanity, SoftBank, Marsh McLennan (rolled to 75k employees), Brex, Docker, WorkOS, MongoDB, Workday, Salesforce, Indeed. Bus factor is healthy — three founders, Gatsby alumni hiring back the band, ~24–36 months of runway at conservative burn. Open issues are high (211 open, 190 PRs) but recent and triaged.

The killer feature

XState-backed durable workflows with a clean Inngest graduation path. Mastra is the one TypeScript framework where suspend/resume is a first-class primitive in the playground, snapshots are persisted to whichever storage adapter you pick, and you can promote the same workflow to run as Inngest functions — gaining step memoization, automatic retries, and event-driven resume — without rewriting steps. Convex’s Ian Macartney spent a week reimplementing this and publicly regretted it; that is unusually credible third-party validation of a moat. Studio (the local UI booted with npx mastra studio) is a close second: trace replay, prompt versioning, interactive workflow suspend/resume, and a “non-engineers can edit prompts” workflow that LangGraph Studio still does not match.

Pricing

License: Apache-2.0 for the core, Mastra Enterprise License for some platform/cloud-only modules — the framework itself is genuinely usable self-hosted with no commercial gate. Mastra Cloud sells two metered products. Platform (Studio plus server hosting) starts at a $0 Starter tier with 100k observability events, 24h CPU uptime, and 10 GB egress; Teams is $250/team/month adding SSO and SOC 2 docs; Enterprise is custom. Add-ons that bite: $10 per 100k observability events, $0.00008/sec CPU, $10/GB egress, and $100/project for a 24/7 persistent server. Memory Gateway is a parallel product priced separately ($250/team/month for 1M memory tokens). Token economics: BYOK on Teams+ has no markup; routing LLM calls through Mastra’s gateway adds a 5.5% markup on top of provider rates.

For a “Real” Hestiia-shape workload (5 agents, 1k events/day, 2 devs), Cloud lands around $3,000–3,300/yr the moment you need SSO; self-hosted with Postgres + Langfuse + a small Inngest tier comes in at roughly the same $3,360/yr but with portability. At “Stretch” scale (20 agents, 10k events/day) the egress line dominates — 1.2 TB/yr × $10/GB after the free 100 GB pushes Cloud to $15–20k/yr all-in. Egress is the line you pre-negotiate; at Enterprise contracts it almost always gets re-cut.

Where it fits Hestiia

Viable, but third best. Mastra is the strongest TypeScript option on the market in 2026 and a real candidate if Hestiia were greenfield-TS. It is not the right call for CLAWD-SALES-AGENT specifically. The honest port estimate is 4–6 engineer-weeks for two engineers — prompts, sub-agent dispatch as the supervisor pattern, and the Pipedrive webhook → 4-step pipeline as a Mastra Workflow are all clean wins (~2–3 days each). The cost lives in two places: replacing claude -p with native Anthropic API calls (≈$15–25k/yr in tokens you do not pay today, plus reimplementing the Skill auto-discovery and Agent-tool subprocess model) and writing a Slack Block Kit interactive-button bridge that Mastra has no adapter for. The first cost is structural — you give up Team-plan economics. The second is real but bounded; the workflow suspend() primitive is actually the right shape for a “human approves in Slack” gate. For a Python-first shop with an Anthropic-native loyalty, none of this justifies fighting the stated stack preference.

Where it breaks at month 12

Two failure modes show up in the field. First, opinionation cost: when your shape does not fit the Agent + Workflow + Memory + Tool box, the fluent .then().branch() chain gets unwieldy — multiple HN reports of teams reaching for plain code or rules engines for the non-LLM portions of pipelines. Second, the breaking-change velocity is real: v1.0 in January 2026 renamed RuntimeContext to RequestContext, moved telemetry to a separate package, restructured imports, changed createTool’s signature, and disabled semantic recall by default. The disaster scenario is mid-2026 storage-adapter deprecation: workflows resume twice after a Node restart because of a tracker bug whose owner left, the OTel tracing pipeline starts dropping spans past 200 concurrent runs, and your firmware-update tool’s timeout interacts badly with suspend(). Migration off is 6–8 weeks because workflow definitions are tightly coupled to Mastra primitives.

Verdict

The right TypeScript agent framework in 2026 — wrong language for Hestiia today.

LangGraph + LangSmith

3.2 LangGraph + LangSmith

What it is

LangGraph is a Python and TypeScript orchestration framework for stateful, durably checkpointed LLM agents, MIT-licensed. The core abstraction is a StateGraph of typed channels (each with an optional reducer) and nodes (plain functions returning partial state updates), executed by a Pregel-style runtime that proceeds in super-steps with concurrent fan-out via Send, conditional routing via Command(goto=...), and durable pauses via interrupt(). Checkpointers (in-memory, SQLite, Postgres, Redis, MongoDB) snapshot state after every super-step, giving you crash-resume and time-travel. LangChain 1.0 (October 2025) is now built on top of LangGraph’s runtime — langchain.agents.create_agent is a thin wrapper over a StateGraph. LangSmith is the proprietary observability and prompt-management SaaS in the same family; LangGraph Platform (rebranded LangSmith Deployment) is the hosted runtime. The trace-and-eval bundle is the headline.

Vendor health

LangChain Inc., founded by Harrison Chase in 2022. Funding: ~$10M seed (Benchmark, 2023), $25M Series A at $200M led by Sequoia (Feb 2024), and a $125M Series B at $1.25B led by IVP in October 2025 — Sequoia, Benchmark, Amplify, CapitalG, Sapphire, plus strategic checks from ServiceNow, Workday, Cisco, Datadog, and Databricks. ~$160M total raised, ~$16M ARR, ~165–260 staff trending up, ~1k paying customers per Latka. The repo is 30.6k stars, MIT, ~270 open issues; PyPI is at ~43M monthly downloads for langgraph alone. LangGraph 1.0 shipped October 2025 with a hard “no breaking changes until 2.0” commitment; latest is 1.1.10 in late April 2026. Named production adopters with public architecture talks: Klarna (85M users, 2.5M conversations, ~700-FTE-equivalent), LinkedIn (Hiring Assistant, hierarchical supervisor saving 4h/role and cutting profile reviews 62%), Uber (developer-platform agents for unit-test generation in code migrations), Replit, Elastic, AppFolio. JP Morgan, BlackRock, Cisco listed as enterprise users. ~35% of Fortune 500 reportedly touch some LangChain product. Bus factor low — Harrison and a deep team plus strategic-investor distribution.

The killer feature

interrupt() plus Postgres checkpointer plus Send-based reducer fan-out — three primitives no other framework matches together. Pause an agent for a 3-day human approval mid-tool-call, kill the worker, resume on a fresh container; do parallel research over 50 leads with reducers merging the results into a single ranked channel; rewind to any checkpoint, fork a thread, re-run with modified state. PydanticAI has no built-in checkpointer; Mastra’s suspend/resume is younger and TS-only; the Anthropic stack offers Managed Agents but no time-travel. For a sales agent where humans are reviewing in Slack, durable interrupt is the single feature that changes the architecture.

Pricing

LangGraph OSS is MIT, free forever, no usage caps. LangSmith Cloud bills per seat plus per-trace: Developer is $0/seat with one seat and 5k base traces/mo; Plus is $39/seat/mo unlimited seats with 10k included traces/mo and one free dev-sized deployment; Enterprise is custom (peer reports of $2k–$5k/mo floors). Trace overage: $2.50 per 1k base traces (14-day retention) or $5/1k extended (400-day). LangGraph Platform on top adds $0.001 per node executed, $0.0007/min dev standby ($30/mo always-on), $0.0036/min prod standby ($155/mo), $0.005/run beyond included, and forces Enterprise past ~10 users.

The disaster math is the trace volume curve: 5 agents × 1,000 events/day × ~6 traces per event = 900k traces/mo, $39×2 seats + 890k overage at $2.50/k = ~$2,300/mo or ~$28k/yr — but this estimate is sensitive to how a “trace” is counted (one user turn, one full agent run, one span) and to whether the included-trace base scales per-seat or in aggregate. Self-host LangGraph OSS with Langfuse self-hosted alongside hits ~$3–8k/yr at Real and ~$15–20k/yr at Stretch. The Cloud option is cheaper at Tiny, comparable at Real, and demands hard negotiation at Stretch. Load-test for a week before signing — the headline $39/seat number is a fraction of the realistic monthly bill for any serious workload, and Langfuse self-host at $1,200/year of infrastructure remains the OSS-equivalent escape hatch. The Plus tier’s standby-cost-even-when-idle ($155/mo for a continuously-on production deployment) is the line that bites people who did not read the pricing page carefully.

Where it fits Hestiia

The conservative number-two — pick this if your buyers are CTOs at Fortune 500s and you want named-customer cover (Klarna, LinkedIn) when defending architecture. The Python primary fit is correct. The CLAWD-SALES-AGENT port shape is clean: four nodes with linear edges, a conditional edge after the orchestrator for skill dispatch, sub-skills as Send-fanned-out worker nodes returning via Command(goto="orchestrator"), the SQLite insight DB kept outside the checkpointer as system-of-record, the Pipedrive webhook pointing at graph.ainvoke({...}, config={"thread_id": deal_id}), and interrupt() as the Slack human-approval gate (a clean upgrade over the current hand-rolled debounce/anti-loop). Effort: 1–1.5 weeks skeleton port, 1–2 weeks reimplementing Claude Code skills as either system-prompt fragments or MCP tools (this is the unknown), 0.5–1 week test migration, ~4–6 engineer-weeks total for a senior. The friction point specific to Hestiia is the same as Mastra’s: claude -p Team-plan economics evaporate; you move Sonnet/Haiku spend onto API-key billing. That is a procurement decision, not a technical one.

Where it breaks at month 12

The bill, the abstraction leak, and the ecosystem reputation tax. Real failure pattern: priced LangSmith at $2k/mo assuming “seats plus traces,” reached month 12 at $14k/mo because every agent run generates 40+ spans and 50k runs/day blew through the ingestion tier; self-host requires the Enterprise contract starting at ~$80k/yr. The abstraction leak: three weeks fighting BaseMessage serialization issues persisting state to Postgres because LangChain’s message types have custom serialization that breaks on new content blocks Anthropic shipped before langchain-anthropic caught up — last gap was 6 weeks. Stack-trace depth is 15–40 frames of Pregel internals at 2am. The Octomind “we removed LangChain” piece and the AWS “Why LangChain Apps Break in Production” thinkpiece (February 2026) mean half your senior hires arrive with a prior. Migration off is 8–12 weeks because LangGraph is sticky.

Verdict

The enterprise-credible pick if your problem genuinely needs durable interrupt and time-travel — overkill if your pipeline is four linear steps and a Slack post.

The Anthropic Stack

3.3 The Anthropic Stack

What it is

Four products that compose into a runtime: the Claude Agent SDK (Python and TypeScript at near feature parity, GA, current Python is 0.1.69 as of late April 2026), Claude Code CLI / claude -p (mature, headless mode is the SDK’s twin), Agent Skills (the SKILL.md open standard at agentskills.io, filesystem-loaded in CLI/SDK and uploaded over API on the platform), and Claude Managed Agents (public beta, launched April 8, 2026, header managed-agents-2026-04-01 required, no GA pricing committed). Sub-agent dispatch is via the Agent tool consuming AgentDefinition specs; in-process MCP via @tool and create_sdk_mcp_server removes subprocess overhead; HookMatcher provides deterministic policy enforcement (PreToolUse, PostToolUse, Stop, SessionStart). Auth via ANTHROPIC_API_KEY, Bedrock, Vertex, or Microsoft Foundry. The SDK explicitly forbids using Pro/Team rate limits for shipped products — production needs API-key billing.

Vendor health

Anthropic itself is not the bus-factor question — the funding rounds and model trajectory are public. The product-level question is whether these specific products survive at twelve months. Agent SDK and Skills are GA with multi-quarter track records (Claude Code shipped 2025, the SDK is the same loop). Managed Agents is beta and launched only three weeks before this writing — no SLA, no ZDR eligibility, named alpha customers (Notion, Asana, Rakuten, Sentry, Atlassian, Vibecode, General Legal, Blockit) are real but small in number, and Anthropic has previously deprecated APIs (the original Claude completions endpoint) on roughly 12-month notice. Treat SDK + Skills + MCP as bedrock; treat Managed Agents as aspirational.

The killer feature

Dev/prod artifact symmetry. The same SKILL.md files, the same .claude/agents/*.md sub-agent definitions, the same .mcp.json, the same hooks — they run identically under claude -p in your terminal, in CI, and in a Python service using the SDK. No other framework gets this right. Pair that with in-process MCP via @tool (your Pipedrive wrapper as a 50-line decorated Python function instead of a stdio subprocess) and the Hooks model for deterministic guardrails (the right place to enforce “never let the agent edit prod.yaml”), and the Anthropic stack is uniquely well-shaped for a Claude-loyal Python shop.

Pricing

Three pricing surfaces stack. API tokens (per million, April 2026): Opus 4.7 $5 / $25, Sonnet 4.6 $3 / $15, Haiku 4.5 $1 / $5. Cache reads at 10% of input; 1h cache write breaks even after two reads, 5m after one. Batch API is 50% off (not available with Managed Agents). Web search is $10 per 1,000. Team plan seats: $20/mo annual Standard, $100/mo annual Premium, the latter with ~6.25× Pro session usage and weekly caps (all-models and Sonnet-only). Critically, Anthropic does not publish absolute token counts; limits are session-based. The implicit “$0/event for prompt-cached repeated runs” property of claude -p on Team plan exists because billing is by session quotas, not tokens — but it breaks the moment you exceed caps or migrate to API-key billing, which the SDK requires for production. Managed Agents: standard token rates plus $0.08 per session-hour, metered to the millisecond, accruing only while running. Anthropic’s worked example: 1 hr Opus session, 50k in / 15k out = $0.705 ($0.525 with 80% cache). The session-hour line is the one to model — negligible at small scale, dominates inference cost past ~5k events/day.

For Hestiia at “Real” volumes (5 agents × 1k events/day, 70% Haiku / 30% Sonnet, 70% cache hit), API-key billing comes in around $15k/yr in tokens; Team-plan billing inside caps is roughly $6k/yr for seats with overage; Managed Agents adds another ~$7k/yr in session hours on top. The honest billing-only delta of switching from claude -p (Team) to Agent SDK (API) is break-even to +$4k/yr at current volume — and you gain typed I/O, in-process tools, structured outputs without --json-schema argv gymnastics, and a clean path past Team-plan caps.

Where it fits Hestiia

Hestiia is accidentally on the golden path. The current shape — claude -p subprocess + Skills + Agent tool + Pipedrive MCP + Team-plan billing — is exactly what Anthropic is selling. The recommended migration is Agent SDK (Python) on API-key billing, keeping Skills and MCP as-is, replacing the Pipedrive subprocess MCP with an in-process @tool wrapper (50 lines, zero subprocess overhead, your own rate-limit handling), and deferring Managed Agents to 2027 or whenever a multi-hour use case materializes. Estimate ~1.5–2 engineer-weeks. The SDK loads .claude/skills/*/SKILL.md and .mcp.json verbatim if you set setting_sources=["project", "user"]. Sub-agent AgentDefinition is a one-line conversion from .claude/agents/*.md. HookMatcher replaces the current bash-hook discipline. Keep Team-plan seats for engineers’ interactive Claude Code; move CLAWD-SALES-AGENT production to API-key (Team plan ToS forbids using its quotas to power external products anyway, so production-on-Team was always quietly out-of-bounds).

The two things to verify in your test harness: that the model: SKILL.md frontmatter actually wins over the --model CLI flag for that turn (log the system/init model field per turn under --output-format stream-json), and that the Pipedrive MCP path you keep — Composio, the open-source Wirasm/pipedrive-mcp, or an in-process wrapper — is one you control the rate-limit handling on.

Where it breaks at month 12

Two failure modes. The pricing-surprise variant: Managed Agents bills both inference and orchestration time, so an agent idle for a tool response is still on the meter; month-eight bill comes in 2.3× the prototype’s projection. The lock-in variant: a competitor model is 40% cheaper for tool-routing in October but you cannot swap without rewriting because the Skills, hooks, and sub-agent semantics are Anthropic-shaped. The acute variant is a region-wide Anthropic outage with no documented fallback because the SDK assumed hosted execution. Honest mitigations: pin the model string explicitly (do not use claude-sonnet-latest), keep the Skills format clean (it is portable to Cursor and VS Code Copilot), set up a Bedrock fallback config flag in advance. SDK + Skills + MCP is ~70% portable; Managed Agents is ~10%.

Verdict

The right answer for Hestiia today: SDK on API keys, Skills and MCP unchanged, Managed Agents not yet.

OpenAI Agents SDK

3.4 OpenAI Agents SDK

What it is

OpenAI’s lightweight Python (and TypeScript) framework, MIT-licensed, free, GA since March 11, 2025. Six primitives: Agents, Handoffs, Sessions (with optional Redis), Tools (function-tools, MCP, hosted), Guardrails (input/output/tool, parallel to the agent loop), and Tracing (built-in, free, sent to OpenAI’s traces dashboard). Sits on top of the Responses API by default (the Assistants API successor — Assistants sunsets August 26, 2026). April 2026 added a harness system shared with Codex plus Sandbox Agents (beta, persistent isolated workspaces with manifests, snapshots, resume). Latest is v0.14.7. ~25.5k stars Python, 3.9k forks. Multi-provider support (Anthropic, Llama, Cohere) is via LitellmModel or AnyLLM, both labeled “best-effort, beta” in official docs.

Vendor health

OpenAI directly maintains the SDK; bus factor is whatever you assess for OpenAI itself. Production references: Coinbase shipped launch-day support, Stripe’s Agentic Commerce Protocol (powering ChatGPT Instant Checkout to 900M weekly users) builds on it, the April 2026 harness is what powers Codex internally. The rough edges are all about which provider you are on: every flagship feature lands OpenAI-first, the Anthropic adapter is officially second-class, and breaking-ish changes still ship monthly.

The killer feature

Hosted tools — web search, file search, code interpreter, computer use — that run on OpenAI infrastructure with no servers to manage and no scraping fragility, plus the Tracing dashboard which is free, included, zero-setup, and genuinely best-in-class for OpenAI customers. The voice-agent stack (gpt-realtime-1.5) is the strongest in the market, and Sandbox Agents is the direct competitor to Anthropic’s “give the agent a computer” model.

Pricing

SDK and tracing dashboard: $0. Models (per 1M tokens, April 2026): GPT-5.5 $5 / $30, GPT-5.4 $2.50 / $15, GPT-5 $0.625 / $5, Pro variants 6× the flagship rate, Batch API 50% off. Hosted tool fees stack separately: web search $10/1k calls, file search $2.50/1k calls plus $0.10/GB/day storage past the first GB, code interpreter $0.03/container session, computer use $3 / $12 per million. Anthropic via LiteLLM works at standard Anthropic rates but with degraded prompt-caching effectiveness (the SDK’s prompt structure is not optimized for Anthropic cache breakpoints — expect 15–30% more spend than native Claude Agent SDK at scale).

Where it fits Hestiia

No. Three reasons. First, the killer features evaporate without OpenAI models — hosted search, computer use, structured outputs, tracing-tied-to-evals are all OpenAI-gated. Adopting the SDK for the bare primitives (Agents/Handoffs/Sessions/Guardrails) discards 60% of what is differentiated. Second, the Anthropic path is officially beta — running Hestiia production on the explicitly-non-supported path of a competitor’s SDK is malpractice. Third, the lock-in pressure ratchets: each new feature ships OpenAI-native first, the gap widens monthly. Worth a 2-day spike only if Hestiia is contemplating multi-provider posture and wants to benchmark voice agents against gpt-realtime — otherwise skip.

Where it breaks at month 12

The hosted-tool lock-in surprise. Your IP — diagnostic playbooks, fault trees, technician transcripts — quietly accumulates inside OpenAI’s file_search vector store because that was the easy default in week three. In September your Anthropic-curious engineer benchmarks Sonnet against GPT for the diagnostic-reasoning task; Sonnet wins by a real margin. You can swap models in the SDK, but the moment you do, you lose file_search, the hosted code interpreter, and the Realtime API integration. At ~500 concurrent runs you hit undocumented org-level rate limits; hosted-tool latency p99 spikes during an OpenAI incident; file_search over a 10 GB store costs $1.5k/mo you did not budget. Migration off is 6–10 weeks.

Verdict

Excellent SDK; adopt only if you are an OpenAI shop, which Hestiia is not.

Inngest + AgentKit

3.5 Inngest + AgentKit

What it is

Inngest is an event-driven durable-execution platform with full-feature SDKs in TypeScript and Python (plus Go and Kotlin); functions are step.run-checkpointed sequences with declarative debounce-with-key-and-timeout, idempotency keys, concurrency keys, fan-out, batching, throttling, rate-limiting, cron, and step.waitForEvent for human-in-the-loop. Inngest server is source-available under SSPL with delayed Apache-2.0; SDKs are pure Apache. AgentKit (@inngest/agent-kit, Apache-2.0) is a separate TypeScript-only library on top — typed agents, MCP tool-calling, networks of agents with shared NetworkState, and deterministic state-based routers. AgentKit agents naturally compose into Inngest functions, so each agent run inherits durability and traces.

Vendor health

Founded 2021 by Tony Holdstock-Brown (CEO) and Dan Farrelly (CTO), both ex-Buffer. Funding: $3M seed (2023, Notable lead), $6.1M extension (a16z, Jan 2024), and $21M Series A in September 2025 led by Altimeter with a16z, Notable, Afore, and Guillermo Rauch. ~$30M raised, ~20 staff. Inngest server at 5.3k stars; AgentKit at ~850, current v0.13.2 (November 2025) — slower release cadence than the core. Public scale claim: 100k+ executions/sec, billions of workflows/month. Named adopters: SoundCloud, Replit, Cohere, Resend, ElevenLabs, TripAdvisor, GitBook.

The killer feature

Debounce-with-key-and-timeout, plus concurrency keys. Pipedrive sends 12 webhooks for one deal in 30 seconds; Inngest merges them and runs the pipeline once with the latest payload — 4 lines of declarative config replacing whatever hand-rolled queue logic Hestiia has today. concurrency: { key: "event.data.deal_id", limit: 1 } guarantees only one pipeline per deal runs at a time, which is the anti-loop property in primitive form. Step memoization on retries means a flaky 529 in step 3 of 4 does not re-burn the first 3 (and their token cost). For Hestiia’s actual pain points, this is the cleanest match in the entire durable-execution space.

Pricing

Hobby is $0/mo with 50k executions, 100k events, 5 concurrent steps, 24h trace retention. Pro is $75/mo base with 1M executions included (extendable to 20M), 5M events, 100+ concurrent steps, 7d trace retention. Overage tiers: $0.000050/exec from 1M–5M, dropping to $0.000015 above 50M. Enterprise (90-day retention, SAML, RBAC, audit trails, SLA) is custom — peer reports of $2k–$5k/mo+. The billing nuance that ambushes people: an “execution” is a single durable function run or step execution, so a 4-step CLAWD pipeline is 4 executions per webhook, not 1. Self-host (SSPL) is real — single binary, official Helm chart, Postgres + Redis dependencies — plan ~$150–400/mo on AWS-equivalent infra plus 0.05–0.15 FTE SRE; official line is no direct support without a paid contract.

For Hestiia at “Real” volume (5 agents × 1k events/day × 4 steps × 1.05 retry multiplier = ~126k executions/mo), Cloud Pro at $75/mo = $900/yr comfortably inside the 1M cap. Self-host at the same scale comes in around $7k/yr including SRE time. Cloud wins at every scale up to Stretch.

Where it fits Hestiia

Yes for the durability layer; no for AgentKit. Adopt the Inngest Python SDK to wrap the existing 4-step CLAWD pipeline — @inngest_client.create_function with debounce={"key": "event.data.deal_id", "period": "60s", "timeout": "5m"} and a per-deal concurrency limit, each step as a step.run for free retries and trace visibility, Pipedrive webhooks publishing via inngest.send(). The 668 Python tests survive — step.run accepts plain async functions, mocking is unchanged. Estimate one engineer-week for the migration plus one week to delete the hand-rolled debounce and queue code. AgentKit itself is a no — it is TypeScript-only, no Python port on the public roadmap, and its value proposition (typed networks, shared state, deterministic routing) is solvable in Python with a 200-line state class. Skip until a Python AgentKit ships; revisit in six months.

Where it breaks at month 12

The execution-count math. 50k runs/day × ~12 step-executions = 18M monthly, past Pro’s reasonable use band, with Enterprise quotes starting around $4k/mo. Trace retention beyond 7 days requires Enterprise; SAML, RBAC, premium support, all Enterprise. The deeper risk specific to AgentKit (which Hestiia would not be running): one maintainer, then zero, with the OSS code intact but PRs rotting; teams become de-facto maintainers of the Anthropic adapter for an abstraction (networks of agents with handoffs) that did not match the actual workload of “one agent doing tool calls.” Inngest the platform itself is solid; AgentKit-the-side-bet is the failure mode.

Verdict

Adopt the Python SDK as the durable layer under whatever agent runtime you pick; ignore AgentKit until it speaks Python.

Trigger.dev

3.6 Trigger.dev

What it is

A TypeScript-first durable-execution platform (Apache-2.0 across the board) that runs your compute on its own machines, using CRIU process-snapshot/restore so a task can wait for hours without burning compute. Tasks, subtasks, batches, retries, schedules; Waitpoints for human-in-the-loop and HTTP-callback resume; Realtime with useRealtimeRun React hooks for streaming LLM tokens to the frontend; AI primitives via ai.tool(myTask) converting a schemaTask into a Vercel AI SDK-compatible tool. v4 GA in August 2025 with a new Run Engine (warm starts 100–300 ms). Python is not a first-class SDK — @trigger.dev/python is a build extension that lets a TS task shell out to a Python script.

Vendor health

Founded by Eric Allam (CTO), Matt Aitken (CEO), and the YC class. $16M Series A in December 2025 led by Standard Capital, with YC, Liquid 2, Wayfinder, Pioneer, Rebel, plus Michael Grinich and CTO Fund. ~$19M total raised, 15–25 staff, 14.7k GitHub stars, 7,100+ commits, 616 releases, v4.4.4 in April 2026 — very active weekly cadence. Customer references at meaningful scale: Midday (bank-sync for 11,500+ customers), Papermark (~6k docs/month). Marketing claim of 30,000+ developers.

The killer feature

Long-running compute with no step boundaries. You write a normal async function that runs for hours; the platform handles it, snapshots state when it waits, restores into a fresh container when the wait resolves. Inngest forces you to slice work into steps inside serverless time limits; Trigger runs whole functions on its own compute. Combined with the React realtime hooks for streaming agent progress to a UI, the demo is genuinely best-in-class for AI agent products with a frontend. For Hestiia specifically — no UI on CLAWD-SALES-AGENT, no need for hours-long agent runs — the killer features are wasted.

Pricing

Free at $0/mo with $5 monthly credit and 20 concurrent runs. Hobby $10/mo ($10 included, 50 concurrent), Pro $50/mo ($50 included, 200+ concurrent runs at $10 per +50, 30-day retention, $20/seat past 25). Compute meters per second by machine size — Small1x at $0.0000338/s, Medium1x at $0.000085/s, Large2x at $0.00068/s — plus $0.000025/run invocation fee. Self-host is free under Apache-2.0 with Docker Compose and Helm, but warm starts, autoscaling, and CRIU checkpoints are Cloud-only — self-host is a downgrade. For CLAWD-shape workloads (Small1x, ~30s avg), Tiny is ~$0/yr (free tier), Real ~$120/yr (Hobby), Stretch ~$720–840/yr (Pro). Same order of magnitude as Inngest Pro at the same scale.

Where it fits Hestiia

No. The language fit is wrong. Hestiia’s stated stack ranks Python > TS > Rust, the prototype is FastAPI, and Trigger’s Python story is a script-runner, not an SDK — you write a TS task that shells out to Python, losing typed payloads, native realtime streams, and proper observability. That is a permanent ergonomic tax on every future agent, paid for the privilege of features (CRIU long-running, realtime UI streams) that CLAWD does not need. The honest framing: evaluate Trigger as the workflow layer for the cloud TS stack (myeko-app, myeko-admin), not for CLAWD. Full Python SDK is on the public roadmap (featurebase) but not shipped — revisit if it lands.

Where it breaks at month 12

You have reinvented half of LangGraph in your own task code: state machine for the agent loop, custom retry for tool calls, your own eval harness — because Trigger is not an agent framework, it is a background-jobs platform you bent into shape. v4-to-v5 (or v3-to-v4) deprecation timer eats three engineer-weeks of refactoring across task definitions; long-tail agent runs at 90 seconds of extended reasoning dominate the bill on per-second compute pricing; a worker dies mid-checkpoint and dead-worker detection takes 15 minutes to react. Migration off is 6–8 weeks.

Verdict

Best-in-class TS durable execution; wrong language axis for Hestiia.

PydanticAI + Logfire

3.7 PydanticAI + Logfire

What it is

PydanticAI is a Python-first agent framework, MIT-licensed, from the Pydantic team. An Agent is a typed object with three things: a system prompt, a set of @agent.tool-decorated Python functions (Pydantic auto-generates schemas from type hints), and an output_type Pydantic model. Dependencies are injected via a typed deps_type and surfaced inside tools through RunContext[Deps] — the FastAPI pattern. Multi-agent comes in two shapes: delegation (agent A calls agent B from inside a tool, ctx.usage forwarded for unified token accounting) and handoff (an output_function transfers control). PydanticAI explicitly does not provide durability — it delegates to four officially-supported partners: Temporal, DBOS, Prefect, Restate. Inngest is not yet officially supported (open issue #3180). Logfire is the paired observability product: OTel-native (spans + logs + metrics, Apache Parquet + DataFusion under the hood), Pydantic-aware rendering, one-line instrumentation via logfire.instrument_pydantic_ai().

Vendor health

Built by Samuel Colvin and the Pydantic team — Pydantic the library underlies the OpenAI SDK, Anthropic SDK, and Google ADK, with ~300M monthly downloads. PydanticAI hit v1.0 in September 2025 (semver stability commitment, no breaking changes until v2), v1.87 by April 2026, ~16.7k stars. Pydantic Inc. raised $17M from Sequoia for the full stack. The soft signal of risk: few public reference logos. Mixam (printing platform via Vstorm) is the most-cited public case. The Pydantic-the-library trust extends to PydanticAI mostly — but the agent-framework space moves faster than data validation, and mindshare is well behind LangGraph.

The killer feature

Type safety end-to-end. Agent[Deps, Output] is generic. Your IDE knows the deps type inside every tool and the output type at every callsite. Combined with structured-output validation that auto-retries on ValidationError (the validation message fed back to the model for correction), you stop hand-rolling the JSON-fix-up loop. No other major Python framework gets this right.

Pricing

PydanticAI is MIT, free. Logfire (effective January 2026): Personal free with 10M spans/mo, 30-day retention, hard stop (no overage). Team $49/mo, 5 seats with extras at $25, 5 projects, 10M spans, $2/M overage. Growth $249/mo, unlimited seats and projects, 10M included spans, $2/M overage, up to 90-day retention. Enterprise custom with self-host (Helm), SSO, HIPAA BAA, custom retention. The flat $2/M-spans pricing with no payload-size or “billable unit” multipliers is structurally cleaner than LangSmith’s per-seat-and-per-trace stacking, Langfuse’s opaque “units,” or Arize’s per-GB. Pydantic’s own published comparison shows Logfire $129 / Arize $999 / Langfuse $3,451 / LangSmith $5,170 at 5 users / 50M spans — vendor-published, but the structural argument holds.

For Hestiia: Tiny is $0, Real is $588/yr (Team plus a small overage allowance), Stretch is $780/yr (Team with 8M overage at $2/M) up to $2,988/yr if you need Growth’s 90-day retention.

Where it fits Hestiia

The cleanest fit. CLAWD-SALES-AGENT (Pipedrive webhook → 4 steps → Slack) is a textbook PydanticAI shape: each step a small Agent with a Pydantic output schema, the pipeline is plain async Python, FastAPI receives the webhook unchanged, Logfire instruments the whole thing in two lines (logfire.instrument_fastapi() and logfire.instrument_pydantic_ai()). Estimated port effort: 2–4 days for one engineer. Day 1 replaces bare LLM calls with PydanticAI agents and defines output models; Day 2 wires Logfire and migrates logs; Day 3 adds the durability layer (DBOS is the most native choice given first-class support and Postgres-only dependency, but Inngest Python SDK works through the open path); Day 4 is tests and cutover. The lock-in profile is the lowest in this section: PydanticAI is plain Python, Logfire is OTel — you can repoint exporters at Honeycomb, Grafana Tempo, or Datadog without code changes. The UI is the lock-in, not the data path.

Where it breaks at month 12

PydanticAI is a library, not a platform — that is the disaster, and it is a soft one. The team loves the framework but has now built the platform around it: Postgres for state, an Inngest layer for async, custom retry policy, a bespoke distributed system that works but is bespoke. When you need agent-to-agent handoffs, multi-day workflow with branching, or graph-shaped control flow, you build it. The Logfire dependency is a startup dependency — Pydantic raised a Series A and monetization pressure will eventually shrink the free tier. The 3am page is “the agent run completed but the structured-output validation failed because the model emitted a number where a string was expected” — and the fix is in your code. Migration cost is 4–6 weeks for PydanticAI itself; the surrounding orchestration is what costs more.

Verdict

The “boring Python” pick — best fit for Hestiia if Anthropic-stack lock-in worries you more than building a little of your own platform.

The Hyperscaler Stack

3.8 The Hyperscaler Stack

What it is

Three products from three clouds, plus OpenAI’s Responses API as a fourth-place pseudo-hyperscaler. AWS Bedrock AgentCore + Strands is the most coherent: Strands is an Apache-2.0 Python and TypeScript SDK (~14M downloads, model-agnostic, MCP-native, Bedrock or Anthropic-direct or anything via LiteLLM); AgentCore is the decomposable infrastructure beneath it (Runtime, Memory, Gateway, Identity, Browser, Code Interpreter, Observability, Evaluations, Policy). The pattern AWS is selling is “write in Strands, deploy on AgentCore.” Azure AI Foundry Agent Service (rebranded Microsoft Foundry at Ignite 2025/2026) is the consolidated Microsoft stack: Foundry Agent Service for the runtime, Foundry Models for the catalog (Azure OpenAI plus 1,800+ partner models), Foundry Tools for the metered tool catalog (Bing grounding, Logic Apps, Azure AI Search), Foundry IQ for connectors. Google Vertex AI Agent Builder / Agentspace (rebranded Gemini Enterprise Agent Platform at Cloud Next 2026): ADK is the open SDK, Agent Engine is the managed runtime with vCPU+RAM billing, Vertex AI Search provides RAG, Memory Bank gives long-term memory.

Vendor health

All three clouds are not going anywhere. Specific products do — Amazon Q for Builders has shifted scope twice, Bedrock Agents was relaunched in 2024 with breaking changes, Azure agent products have been rebranded twice. Strands itself is the safest of the three SDKs because it is Apache-licensed and runnable anywhere; ADK is similarly portable; Foundry Agents is the most cloud-locked.

The killer feature

Per platform: AgentCore Runtime’s I/O-wait-free billing (you do not pay for compute while a tool call is in flight) plus AWS-native IAM, VPC, and observability inheritance for procurement-bottlenecked shops. Foundry’s Bing grounding (the only first-party Bing search-with-citations any agent can call from a single API). Vertex’s free tier on Agent Engine (50 vCPU-hours and 100 GiB-hours/month) plus the cheapest flagship model in market (Gemini 2.5 Pro at $1.25 / $10).

Pricing

AgentCore: Runtime $0.0895/vCPU-hour + $0.00945/GB-hour (per second, 1s minimum, I/O wait free); Gateway $0.005/1k invocations and $0.025/1k searches; Memory $0.25/1k events short-term, $0.75/1k records/month long-term; Identity $0.010/1k token requests; Evaluations $0.0024 / $0.012 per 1k tokens. Bedrock Sonnet 4.6 tokens are $3/$15 — same as direct from Anthropic. Note the cache-read pricing has been flagged as stale: earlier figures show $0.30 on Anthropic vs $0.60 on Bedrock for the legacy 3.5 family, but Sonnet 4.6 cache-read on Bedrock matches Anthropic’s $0.30/M as of Q1 2026 — verify the current AWS pricing page before quoting “Bedrock cache is more expensive” in any stakeholder doc, because as of this writing it is not.

Foundry: “no additional charge to use Foundry Agent Service” — you pay the underlying meters only. Tokens (GPT-5.5 $5/$30, o3-deep-research $10/$40), Bing grounding $14 per 1,000 transactions (this is the line that bites web-search-heavy agents fast), Azure AI Search standard tiers $75–$1,000+/mo for the index.

Vertex Agent Engine: Runtime $0.0864/vCPU-hour, $0.009/GiB-hour, with 50 vCPU-hours plus 100 GiB-hours/month free; sessions $0.25/1k events; Memory Bank $0.25/1k memories/month plus $0.50/1k retrievals (1k free/month); Vertex AI Search $1.50–$6.00/1k queries with 10k/month free; Gemini 2.5 Pro $1.25/$10.

For Hestiia at “Real” volume, all three converge on roughly $1.5–4.5k/mo, dominated by tokens. The infrastructure layer is rounding error against Sonnet spend until you reach Stretch — this is the right shape and AWS in particular is not gouging on orchestration.

Where it fits Hestiia

Strands on ECS, Anthropic direct, is a real candidate — and the only hyperscaler option worth Hestiia’s evaluation cycles. You already run NestJS on ECS, RDS Postgres, Timestream, AWS IoT Core; you have the AWS muscles. Path: write the agent in the Strands TypeScript SDK, deploy as an ECS service alongside myEko-backend-api, persist sessions in RDS, observe via OpenTelemetry → CloudWatch. Zero new platform spend, zero lock-in beyond AWS-which-you-already-have. ~2–4 weeks for an MVP, +1 week if you go AgentCore Runtime. If/when durable agent state matters, swap the model client to Bedrock and adopt AgentCore Memory plus Runtime — adds maybe $200–500/mo at Real volume but saves a week of in-house Postgres-for-sessions plumbing. Avoid classic Bedrock Agents (the console-driven 2023 product) — the AWS direction is unambiguously AgentCore + Strands; the console-Agents path is being eclipsed.

Azure Foundry: skip. Hestiia has zero Azure footprint, the Anthropic catalog on Foundry lags Bedrock, and the migration tax dwarfs any platform benefit. Vertex Agent Builder: skip for the same reason — no GCP footprint, Anthropic-first model preference, and even Gemini Pro’s 2× cheaper input does not pencil at Hestiia’s volumes once you add dual-cloud ops cost.

Where it breaks at month 12

For AgentCore specifically: per-region quotas on concurrent agent invocations (50 by default in us-east-1) require a support ticket to raise, and the first time you need it raised you wait 11 days. Anthropic ships a new feature; it appears in Anthropic API on day zero, in Bedrock 4–8 weeks later, in Bedrock Agents 4–8 weeks after that. If you took the Bedrock-Knowledge-Bases-for-RAG path, you are now paying $350/month minimum on the OpenSearch Serverless 2-OCU floor even for a query-twice-a-day workload. The action-groups tool abstraction requires an OpenAPI spec plus a Lambda — 200 lines of CDK for a one-line tool. Migration off classic Bedrock Agents is 8–12 weeks. The clean exit strategy throughout: write Strands, deploy on AgentCore for now, port to your own runtime later.

Verdict

Strands on ECS with Anthropic direct is a real second option for Hestiia behind the Anthropic stack; the other two hyperscalers are not on the table.

Cloudflare Agents and Project Think

3.9 Cloudflare Agents and Project Think

What it is

A genuinely different architectural shape: a TypeScript SDK (agents-sdk) where every agent is a Durable Object — a globally-addressable, single-threaded actor with built-in SQLite, KV state, hibernation, and WebSocket support. State is per-instance and survives restarts and deploys; agents idle at zero cost because they hibernate. Models come from Workers AI (Cloudflare’s own catalog of open-source models) or any external provider via AI Gateway. The 2026 evolution is Project Think: durable execution with fibers (crash recovery, checkpointing), sub-agents with isolated SQLite plus typed RPC, persistent sessions, sandboxed code execution, a managed Agent Memory service, voice pipelines, email agents, and a unified inference layer fronting 14+ providers. License is open for the SDK; the Durable Objects runtime is Cloudflare-only.

Vendor health

Cloudflare itself is public, profitable, and ships shipping. The Agents product has the same trajectory and bus-factor profile as Workers and Durable Objects — high. Adoption signal is harder to read than the Anthropic or LangChain ecosystems because Cloudflare’s customer disclosures lean enterprise-vague.

The killer feature

Durability is the default, and idle is free. Every agent has its own SQL database that survives across deploys; you can spin up millions of agents (one per user, per session, per device) and they cost zero when not in use because they hibernate. For a fleet-of-heaters use case this is structurally interesting in a way no other framework matches — one Durable Object per heater, owning the agent state, idle 99% of the time. The voice and email pipelines plus the AI Gateway in front of multi-provider inference make this a coherent edge stack.

Pricing

Workers Paid plan minimum $5/mo. Durable Objects compute: 1M requests plus 400k GB-s included, then $0.15 per million requests and $12.50 per million GB-s. DO storage (SQLite-backed): 5 GB plus 25B row reads and 50M row writes included, then $0.20/GB/month, $0.001 per million reads, $1 per million writes. Workers AI: $0.011 per 1k Neurons (Llama 3.1 8B at $0.15 / $0.29 per million tokens, Llama 3.1 70B at $0.29 / $2.25, embeddings from $0.012/M). AI Gateway for proxying Anthropic or OpenAI: free tier, then per-request fees. Cloudflare does not host Claude — you call Anthropic direct through AI Gateway, paying Anthropic prices.

For Hestiia’s CLAWD-SALES-AGENT shape: Tiny $5/mo plan plus ~$50/mo Anthropic = ~$55/mo. Real $5 plus ~$3k tokens plus ~$50 DO compute = ~$3k/mo. If a future “agent per heater” feature gives 100k devices each a long-lived agent DO, storage runs $20–100/mo before any compute.

Where it fits Hestiia

Two answers. For CLAWD-SALES-AGENT today: no — the backend agent lives next to NestJS on ECS in AWS, and straddling AWS and Cloudflare buys nothing. For a hypothetical future “agent per heater” product feature where each MyEko Pro device has a small persistent reasoning loop: yes, structurally this is the best-shaped runtime in the market for that pattern. One Durable Object per heater (r-XXXXXXXX thing-id keyed), idle most of the time, persistent SQLite for device-specific reasoning state, WebSocket for the local-shadow-bridge or for streaming events back to mobile clients. Worth keeping on the radar specifically for that use case; not the right fit for backend agents this year.

Where it breaks at month 12

The lock-in is structural: Durable Objects are Cloudflare-only, and the value proposition (per-device cheap idle agents) rebuilds on essentially nothing else. Migrating off means swapping DOs for Inngest plus Postgres or an actor framework like Akka — a from-scratch rewrite of the agent state model. The other failure mode is the “we ended up needing a feature only AWS or GCP has” trap — if Hestiia’s per-heater agent eventually needs Bedrock-native compliance certifications or AWS IoT Core integration that does not pass cleanly through AI Gateway, you are now operating two clouds for one product.

Verdict

The right shape for a future per-heater agent; the wrong cloud for backend agents this year.


Durable Execution Platforms

3.10 Durable Execution Platforms

What it is

A durable execution engine answers exactly one question: when the host process crashes between steps three and four, what happens to step four. The engine owns the queue, the retry policy, the timer, the schedule, the idempotency boundary, and the durable state of every in-flight workflow. It does not care whether step three calls Claude, calls Pipedrive, or runs a regex. The agent framework is the opposite — it cares deeply about prompts, tools, models, and traces, and not at all whether step three was retried zero times or seventeen times.

The reason this is its own chapter is that the new wave of agent vendors (LangGraph Platform, OpenAI Agent Builder, Mastra Cloud) bundle a thin durability layer into the runtime, and the new wave of durable execution vendors (Inngest, Trigger.dev, Restate, Hatchet, DBOS) bundle a thin agent kit into theirs. The pragmatic architecture for Hestiia is to pick a durable execution engine first, treat it as multi-year infrastructure, and then plug whichever agent framework wins this quarter inside its workflows. Temporal, Restate, and DBOS keep this separation clean. Inngest and Trigger.dev (covered in Section A as agent-adjacent runtimes) blur it deliberately.

Leading vendors

Temporal is the category-defining engine, born inside Uber as Cadence and forked in 2019 by its original authors. Workflows are written as plain code in Go, Java, TypeScript, Python, .NET, PHP, or Ruby; every external call goes through activity functions; the engine records every input and output to a history log and replays the workflow on crash to reconstruct local state. Snowflake, Stripe, HashiCorp, Coinbase, and Box run it in production. The server is MIT-licensed and free to self-host. Temporal Cloud bills per “Action” — the Essentials tier is $100/month minimum with 1M actions and 1 GB active storage included; Business is $500/month minimum for 2.5M actions and 2.5 GB; overages tier from $50 per million actions (first 5M) down to $25 per million (past 200M); active storage runs $0.042/GB-hour. There is a $1,000 starter credit. The Temporal pitch is unambiguous: this is the boring proven thing for workflows that run for days, weeks, or a year, and for polyglot teams.

Restate is the 2024 entrant from ex-Apache Flink committers, $7M seed. It ships as a single Rust binary that gives you durable execution, durable state, durable communication, durable scheduling, and virtual actor objects out of one process backed by an embedded RocksDB. The architectural elegance is that your code stays a normal HTTP handler and Restate replays it on crash by pretending external calls already happened. First-class virtual objects give you keyed stateful actors with single-writer guarantees — a uniquely useful primitive when each agent run is the durable owner of a long-lived deal or household. Source is BUSL 1.1 (converts to Apache 2.0 after four years). Cloud pricing: Free at 50K actions/month, Starter at $75/month for 5M actions, Business at $300/month for 20M, Premium at $1,000/month for 50M, and Enterprise custom up to 100K/sec. Overages run $25 per million on Starter/Business and $10 per million on Premium.

DBOS is durable execution as a library, not a service. Founded in 2024 by Mike Stonebraker (Postgres, Vertica, Turing Award) and Matei Zaharia (Spark). Decorate Python or TypeScript functions with @DBOS.workflow and @DBOS.step; DBOS records every step into Postgres tables alongside your business data; on crash it replays from the same Postgres, in the same transaction. No broker, no separate cluster, no new deployment unit. The Transact library is MIT-licensed. The optional Conductor control plane is the paid product: Pro at $99/month (1M checkpoints, 5 apps, 3 seats), Teams at $499/month (10M, 10 apps, 10 seats, Slack support), Enterprise custom (self-hosted, air-gapped). Overages are $50 per million checkpoints. DBOS ships first-class wrappers for the OpenAI Agents SDK, PydanticAI, and LangChain.

Hatchet (YC W24, Postgres-backed, MIT) combines task queue, DAG orchestrator, durable execution, rate-limiting, and scheduling in one binary. Cloud pricing: free dev tier with 100K runs/month and $10/M overage; Team at $500/month (10 users, 5 tenants, 500 RPS); Scale at $1,000/month (HIPAA, audit logs, 7-day retention); Enterprise custom with bring-your-own-cloud. Prefect plus its agent kit Marvin (the successor to the now-archived ControlFlow) is the Python-data-orchestration veteran, ten years old, Apache 2.0, with cloud pricing scaling by seats and workspaces (Hobby free, Team historically around $400/month, Pro custom).

Windmill is the AGPL-3.0 outlier worth naming on the durability axis as well as the low-code axis (covered in §3.14). Windmill is most often pitched as an open-source Retool replacement, but underneath the UI it is a durable script runner — flows are durable across worker restarts, retries are first-class, every script and every flow can be exposed as an MCP server with one config change. Rust-and-Svelte, ~16K stars, single-binary self-host plus Postgres. For a Python team that wants Temporal-equivalent durability without operating a Temporal cluster, Windmill is a credible third option behind DBOS and Hatchet — the catch is the AGPL-3.0 license, which triggers copyleft on modifications to Windmill itself but does not touch user-authored flows (those remain plain Python or TypeScript files, not derivative works of Windmill). Cloud pricing on Windmill EE starts around $120/month for the team tier; the OSS path is genuinely free at any scale. Hestiia’s centre of gravity (FastAPI plus already-running Postgres) lands DBOS first regardless, but if the durability decision is reopened in 2027 because the agent farm has grown into something flow-orchestration-shaped rather than webhook-orchestration-shaped, Windmill is the dark-horse to evaluate.

AWS Step Functions is the default AWS answer — visual state machine plus JSON ASL definition, Standard workflows at $25 per million state transitions running up to one year, Express at $1 per million requests plus duration, with a 4K transition free tier. Cloudflare Workflows rides the Workers Paid plan ($5/month base) with 10M requests, 30M CPU-ms, and 1 GB storage included; overages run $0.30 per million requests, $0.02 per million CPU-ms, and $0.20 per GB-month. TypeScript-only.

Airflow and Dagster belong in the orientation tier rather than the contender tier. Airflow 3.0 GA’d in 2025, 3.2 in April 2026 — architecturally batch-and-scheduler-first, weak at sub-second event-driven flows, and not a real durable execution engine in the Temporal/Restate sense. Dagster+ Cloud (May 2026): Solo at $10/month plus $0.04/credit, Starter at $100/month plus $0.035/credit, Pro custom; serverless compute $0.01/minute. Both are excellent at what they do — data pipelines, asset graphs, scheduled batch — but they are the wrong abstraction for crash-resumable agent loops.

Where it fits in a Hestiia-shaped stack

Hestiia’s profile is event-driven webhooks, multi-step agent pipelines, ~20 people, AWS-resident, FastAPI today on a hand-rolled SQLite queue, and Postgres already in production via the NestJS backend. The honest ranking is DBOS first, Restate second, Temporal third. DBOS is the boring-correct answer because Postgres is already on the bill, the library imports into the existing FastAPI app with no new deployment unit, and the Conductor UI ($99–499/month) gives versioning, alerting, and recovery without standing up a cluster. Restate is the architecturally most elegant answer specifically for the AI-agent framing — virtual objects are the right primitive for “this deal owns a durable agent forever.” Temporal is overkill for today’s volume but wins at the three-year horizon if the MyEko fleet hits 100K+ devices and the orchestration surface grows beyond CRM action.

Step Functions is a reasonable default if Hestiia were already deeply invested in AWS-native orchestration; the team is not, and the lock-in cost outweighs the convenience. Cloudflare Workflows is TypeScript-only and irrelevant given the Python centre of gravity.

Verdict

Start with DBOS for the agent farm — lowest blast radius, lowest TCO, already-Postgres — and plan Restate or Temporal as the migration target if action volume crosses ~10M/month or workflows grow to multi-day durations.

Observability and Eval Platforms

3.11 Observability and Eval Platforms

What it is

You can change agent frameworks every six months without losing very much. You cannot change observability that often, because the value compounds with the data you have already collected. Three months of production traces, a curated eval dataset built from real failures, and historical cost-per-conversation tracking are the asset; the framework that produced them is replaceable. Most teams underweight this because, on day one, observability looks like a thin “log the prompt and response” layer that any junior can write in an afternoon — exactly what Hestiia did with the SQLite-plus-Svelte dashboard. That works until you need three things simultaneously: traces (multi-step debugging, span trees, replay-from-step), evals (turning real production traces into regression datasets, running LLM-as-judge experiments, gating prompt and model changes), and cost attribution (per-agent, per-customer, per-feature dollar accounting tied back to org and user). Tools that do traces brilliantly often treat evals as bolt-on. Pure eval shops treat cost as nice-to-have. A FinOps tool does cost but cannot replay an agent. The decision is which is your dominant pain — and whether you accept being locked to one vendor for all three or compose them.

The single most important development of the last 18 months is that the OpenTelemetry community now publishes GenAI semantic conventions — gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, plus a separate spec for agent spans (gen_ai.agent.*) and tool-call spans, with explicit conventions for OpenAI, Anthropic, Azure AI Inference, AWS Bedrock, and MCP. As of April 2026 the conventions are still officially experimental but stabilising, and adoption is the inflection point: Datadog began native semconv ingestion in OTel collector v1.37; New Relic launched OTel-native AI Monitoring in February 2026; Honeycomb is OTel-native by design; Logfire is built on OTel from day one; OpenInference (Phoenix’s instrumentation library) emits compliant spans; Langfuse added a native OTLP endpoint in 2025; Braintrust accepts OTel; LangSmith now ingests OTLP. The hedge writes itself: instrument once, export anywhere. If your code emits OTel-compliant spans, you can dual-write to Langfuse and Datadog tomorrow, swap to Phoenix the day after, and keep your traces portable. For a 20-person hardware company that historically holds infrastructure decisions for a decade, this matters more than any feature comparison below.

Leading vendors

Langfuse is the category-leading open-source LLM engineering platform — tracing, prompt management, eval datasets, LLM-as-judge, human annotation queues, playground. ClickHouse-backed; acquired by ClickHouse Inc. in January 2026 with a $400M Series D, but pricing and licensing were unchanged. MIT-licensed core; cloud SaaS in EU/US; self-host via Docker Compose or Helm. Hobby is $0 (50K units/month, 30-day retention, 2 users). Core is $29/month (100K included, $8 per 100K overage, 90-day retention, unlimited users). Pro is $199/month (100K included, same overage, 3-year retention). Enterprise is $2,499/month with SSO, RBAC, and a dedicated support engineer. Self-hosted Enterprise licence (support, SSO, audit logging) typically lands at $20–50K/year per community reports. The dataset → experiment → prompt-version loop is genuinely best-in-class.

LangSmith is LangChain Inc.’s commercial observability — tightest integration with LangChain/LangGraph, weaker outside that ecosystem. Developer is free (5K traces, 1 seat). Plus is $39/seat/month with a 10K base trace allotment per seat included, $2.50 per 1K trace overage, and $5 per 1K for 400-day extended retention. Enterprise is custom with a self-host option. The TCO calculation here is non-trivial and the public maths in the source material may understate or overstate the real bill depending on how the included-trace base scales with seats. At five seats the included base is roughly 600K traces/year; whether that covers 3M/year of real fleet traffic depends on how a “trace” is counted (one user turn, one full agent run, one span). The honest advice is to load-test for a week before signing — the headline $39/seat number is a fraction of the realistic monthly bill for any serious workload, and Langfuse self-host at $1,200/year of infrastructure remains the OSS-equivalent escape hatch.

Braintrust is the premium evals-first platform — the experiment loop (prompt change → run on dataset → diff scores → ship) with first-class production tracing layered on top. Brainstore is a custom OLAP for trace queries. Series B of $80M from ICONIQ in February 2026 at $800M valuation. Closed-source, cloud SaaS or Enterprise on-prem. Starter is $0 with 1 GB processed data included then $4/GB, 10K scores then $2.50/1K, 14-day retention. Pro is $249/month with 5 GB included then $3/GB, 50K scores then $1.50/1K, 30-day retention, S3 export. Enterprise custom. The eval workflow is the reason to pay — CI integration for prompt regressions, side-by-side diffing of N versions across a dataset, and “Loop” (their built-in agent that proposes prompt edits and runs evals autonomously). At Hestiia’s volume Pro lands somewhere in the $3–6K/year band.

Phoenix / Arize AX is two products from one company: Phoenix is the source-available observability and eval platform (ELv2-licensed, not Apache-2.0 as is sometimes reported — runs in a notebook, Docker, K8s, or pip install), Arize AX is the commercial managed offering with the enterprise ML-observability heritage layered on (drift detection, embedding analysis, structured ML metrics on top of LLM tracing). OpenInference is Arize’s OTel-compatible instrumentation library and is Apache-2.0 — the licensing distinction matters because ELv2 is fine for internal use but constrains anyone wanting to resell a managed agent product whose observability layer is Phoenix. For Hestiia internal use, ELv2 is no constraint. Phoenix is $0. AX Free is 25K spans/month, 1 GB ingest, 15-day retention. AX Pro is $50/month for 50K spans, 10 GB ingest, 30-day retention. AX Enterprise is custom with SOC2/HIPAA and self-host. The pitch is the OpenTelemetry hedge — even if you choose Langfuse for daily use, instrumenting with OpenInference makes Phoenix a free local fallback you can boot in five minutes for ad-hoc debugging.

Helicone was a proxy-based gateway with observability bolted on — change base_url and you get logging, caching, rate-limiting, A/B routing, and cost dashboards with no SDK changes. Apache-2.0, YC W23. Status as of April 2026: maintenance mode following Mintlify’s acquisition of Helicone on 2026-03-03, confirmed in Helicone’s own blog post and Mintlify’s acquisition announcement. The Helicone team explicitly recommends migration to LiteLLM or Portkey for new deployments. Treat Helicone as historical context, not a 2026 procurement option; the slot in the recommended stack is filled by LiteLLM (covered in §3.13).

Logfire (Pydantic) is built on a pure OpenTelemetry base by the Pydantic team, with first-class Python ergonomics. Personal is $0 (10M spans/month, 30-day retention, hard cap, no overage). Team is $49/month with 10M included, $2/M overage, 30-day retention. Growth is $249/month with up to 90-day retention. Enterprise is custom with $1/M overage and a self-host Helm option (announced 2025, GA 2026). At Hestiia’s 3M spans/year you do not even leave the free tier; Personal covers it forty times over. Native OTel means every span is portable.

Lunary is OSS observability with a strong RAG/chatbot focus, Apache-2.0, EU-hosted, GDPR-native. Free (10K events, 30-day retention, 3 projects). Team $20/user/month (50K events, $10 per 50K overage, 1-year retention). Enterprise custom. Right answer if European data residency is the dominant constraint.

PromptLayer is prompt-management-first, observability second — UI built for non-technical PMs to edit and version prompts. Free (5K requests, 7-day retention). Pro $50/seat/month (unlimited retention, transaction-based overage). Enterprise custom with SOC2 Type 2, HIPAA, GDPR. Right answer if your sales team wants to edit the sales-assistant prompts without engineering involvement.

Datadog LLM Observability is bolt-on to existing Datadog, with a new structure effective May 2026 — bills per LLM span, with reports of an automatic ~$120/day premium activated on detection of LLM spans (no opt-out). Right answer only if you already pay for Datadog and want one pane of glass; otherwise expensive, with widely reported bill increases of 40–200% when added.

Honeycomb is general-purpose OTel observability, exceptional at high-cardinality queries, not LLM-specialised. Free 20M events/month, Pro from $130/month (1.5B events/month), Enterprise custom. If Honeycomb is already your backend tracing, GenAI semconv spans drop in alongside everything else with no extra tooling — but no prompt management and no built-in evals, so pair with Phoenix or Braintrust.

Open-source eval frameworks

The commercial roundup above lists twelve platforms and zero open-source eval frameworks, which is a real gap. Two free options deserve naming.

Inspect AI is the UK AI Safety Institute’s eval framework — MIT-licensed, the de facto standard for safety evaluations (Apollo Research, METR, and the AISIs themselves use it), with a clean Python API for defining tasks, solvers, and scorers and a built-in viewer for inspecting outputs. For a Python shop on Langfuse self-hosted, Inspect AI is the natural pairing for offline regression evals: write the eval once as code, run it in CI on every prompt or model change, archive the results. Free.

Promptfoo is the OSS prompt-and-model testing tool, MIT, with strong GitHub Action integration. You define test cases in YAML, point at a prompt or a model, and run side-by-side comparisons with assertions. The CI ergonomics are the killer feature — you can gate a PR on “no regression on the smoke set.” Free; paid Promptfoo Cloud exists but the OSS path covers the canonical use case.

Both are complementary to a tracing platform, not substitutes. The picture for a frugal Python team is: Langfuse self-hosted for runtime traces, Inspect AI in CI for offline regression, Promptfoo as the lighter-weight prompt-diff harness. Total cost: $1,500/year of infrastructure plus an engineer-week.

Where it fits in a Hestiia-shaped stack

The stack we would build for Hestiia is Langfuse self-hosted as the primary backend (~$1,200/year of infrastructure on a small EC2 + ClickHouse + Postgres footprint), instrumented via OpenInference (or, for a strictly vendor-neutral posture, OpenLLMetry — Traceloop’s Apache-2.0 OTel-aligned instrumentation library, ~6.9K stars, with auto-instrumentation across 40+ provider SDKs and a proposed donation path to OpenTelemetry directly), so that Phoenix is always a free local debug option and the whole stack is portable to Datadog or Logfire if priorities ever change. Inspect AI lives in CI for offline regression; Promptfoo for prompt diffs. A second internal Helicone-style proxy slot used to be appealing for the simpler agents that hit OpenAI or Anthropic directly, but with Helicone in maintenance mode the answer is now unambiguously LiteLLM proxy (covered in 3.13). For teams whose dominant pain is agent-specific (not general LLM) observability, Laminar (lmnr-ai/lmnr, Apache-2.0, ~2.8K stars, YC S24) is the smaller third OSS option behind Langfuse and Phoenix — trace replay UI, eval datasets, RAG-aware, actively developed; worth a bake-off if Langfuse’s agent-specific UI ever proves the bottleneck.

Skip Datadog LLM Obs (too expensive without an existing Datadog footprint), skip LangSmith unless LangGraph wins Section A’s framework decision, and keep Braintrust on the radar specifically for the CCTP-analyser eval loop if Langfuse’s eval tooling proves inadequate. The two coexist cleanly via OTel.

Verdict

Langfuse self-hosted with OpenInference instrumentation, paired with Inspect AI and Promptfoo in CI; budget $1,500/year and an engineer-week of setup.

Multi-Agent Frameworks

3.12 Multi-Agent Frameworks

What it is

The phrase “multi-agent” has become elastic to the point of marketing-speak. Five distinct topologies are conflated under it: sequential pipelines (agent A’s output feeds B — just a function chain with LLM calls), supervisor-and-workers (a planner dispatches subtasks — the ChatGPT-with-tools pattern wearing a costume), peer networks (agents talk freely — looks impressive, fails in production because no one owns global state), hierarchical trees (same as networks plus extra latency), and role-playing debate (Critic versus Researcher — research-paper territory). The “Why Do Multi-Agent Systems Fail?” study (1,600 traces across seven frameworks) found 41.8% of failures were specification or design issues and 36.9% were inter-agent misalignment. The multi-agent topology itself is the bug. By 2026 the practitioner consensus has converged on: single-agent plus tools wins for over 70% of real workloads.

Leading vendors

CrewAI is the most-marketed of the lot — Python framework for “role-playing autonomous agents” with Agent (role/goal/backstory), Tools, Crew, Task, and Process (sequential or hierarchical). Founded by João Moura (ex-Clearbit AI) October 2023; $18M Series A from Insight Partners October 2024; 50.2K GitHub stars April 2026; MIT. Public customers: PwC, IBM, Capgemini, NVIDIA. Self-reported 450M agent executions/month. Pricing on the official page is two-tier — Free (50 executions/month) and Enterprise (custom) — but third-party reporting cites Basic $99/month, Standard $500/month, Pro $1,000/month, Enterprise typically $30–60K/year, Ultra (500K executions, dedicated VPC) ~$120K/year, with a $0.50/execution overage. For Hestiia: $36–60K/year to replace one Python file calling Claude with tools is value-destroying.

AutoGen / AG2 / Microsoft Agent Framework: AutoGen was Microsoft Research’s 2023 multi-agent conversation framework. After Chi Wang and Qingyun Wu left Microsoft, the codebase was forked under ag2ai/ag2 in November 2024. AutoGen has effectively been discontinued as an independent product — in October 2025 Microsoft announced it would merge AutoGen with Semantic Kernel into the new Microsoft Agent Framework (MAF), which shipped 1.0 GA on April 3, 2026. AG2 (4.5K stars, v0.12.1 April 2026) continues under volunteer maintainers from Meta, IBM, and academia. All three are free and open source. MAF is the right answer for .NET and Azure shops — neither of which describes Hestiia.

Agno (formerly Phidata, rebranded January 2025) ships three layers: a Python SDK for agents/teams/workflows, a stateless FastAPI AgentOS runtime, and a Control Plane UI. 39.7K GitHub stars, Apache 2.0, v2.6.3. Architecturally it is serious — ~2µs agent instantiation, ~3.75 KiB per agent, horizontally scalable. Free OSS. Pro $150/month (1 live connection, 4 seats, +$30/seat, +$95/connection). Enterprise custom (likely $15–40K/year band). Notably, AgentOS can run agents built in other frameworks (Claude Agent SDK, LangGraph, DSPy) — which makes it more interesting as a runtime than as a framework.

LlamaIndex Workflows + AgentWorkflow is the RAG-anchored multi-agent answer. LlamaIndex started as the leading RAG and indexing library and bolted on event-driven async workflow graphs with multi-agent orchestration on top. 49K GitHub stars, MIT, ~$27M raised. OSS is free; LlamaCloud (managed parsing and indexing) is consumption-priced. The narrow exception worth a half-day spike at Hestiia is exactly this: a document-heavy workflow over CCTPs, RE2020 dossiers, or tender PDFs. LlamaIndex Workflows is the natural home for that.

Where it fits in a Hestiia-shaped stack

For an internal CRM-action agent farm — “look up this deal, draft a Slack message, create a Pipedrive task” — multi-agent is almost always the wrong abstraction. Ninety per cent of “agents” are deterministic functions wearing an LLM hat (they need an API call and a structured output, not a personality). Nine per cent are genuinely agentic (tool-using, multi-turn) and a single capable model with a good toolset wins. Less than one per cent are tasks where parallel decomposition genuinely helps, and even those are usually asyncio.gather over single agents. Claude Code’s “single agent plus sub-skill dispatch” pattern is structurally better than CrewAI’s role-play model for any deterministic workflow.

Where multi-agent frameworks would fit, hypothetically: long-form research-and-synthesis tasks where planner-versus-critic genuinely improves output quality, and document-heavy agent pipelines where retrieval is the dominant bottleneck (LlamaIndex). Not Hestiia today.

Verdict

Skip the category for the current use case. Reserve a half-day to prototype LlamaIndex Workflows the day Hestiia builds a CCTP-parsing or tender-analysis agent.

Model Gateways

3.13 Model Gateways

What it is

A model gateway is a network hop or library between your code and the LLM provider. Three real reasons to deploy one. First, decouple model code from provider — your application calls one OpenAI-compatible endpoint and the gateway routes to Anthropic, Bedrock, Vertex, or local-vLLM. Second, cache, route, and fail over at the infrastructure layer — prompt caching, semantic caching, automatic retries, model fallbacks, load balancing. Third, cost, policy, and audit — per-team virtual keys with budgets, PII redaction, prompt-injection guardrails, full request logs for SOC2 evidence. When a gateway is not needed: single-provider shops with single-digit deployed agents and no compliance ask.

Leading vendors

LiteLLM is the de facto OSS standard — Python SDK plus a stand-alone proxy server. The proxy is where the gateway value lives: virtual keys per team or user, Postgres-backed spend tracking, RPM/TPM limits, guardrails, Langfuse/Arize/LangSmith logging hooks, automatic fallbacks, load-balanced routing. Drop-in replacement for the openai SDK — set OPENAI_BASE_URL=http://litellm:4000 and existing code talks to Claude, Gemini, Bedrock, or vLLM with no library changes. OSS is MIT and free. Enterprise tier (custom pricing) adds JWT auth, SSO, audit logs, SLAs. Self-hosting cost is ~$500–3,000/month once you count Postgres, the container host, and DevOps time. Reports 240M+ Docker pulls.

OpenRouter is a hosted gateway: point at openrouter.ai/api/v1, top up credits with a card, and call any of ~300 models behind one OpenAI-compatible API. They claim no markup on token rates; costs are a 5.5% Stripe fee on credit purchases (5% on crypto) and a BYOK fee around 5% of equivalent OpenRouter pricing. Best fit: prototyping across many models. Wrong fit for a 20-person company that already has direct provider contracts.

Portkey is a hosted gateway plus observability with an OSS self-hostable core (Portkey Gateway, MIT). Routes to 250+ models; automatic retries, fallbacks, conditional routing, semantic caching, guardrails, prompt management as first-class features. Self-hosted OSS free. Hosted Developer free (10K logs/month, 3-day retention). Production $49/month (100K logs, then $9 per additional 100K up to 3M, 30-day retention). Enterprise custom (10M+ logs, SSO, SOC2). The most production-credible alternative if Helicone really is in maintenance mode.

Cloudflare AI Gateway is the cheapest serious option — sits on Cloudflare’s edge, core features free on every Cloudflare plan. Logs are the gotcha: Workers Free is 100K logs/month total across all gateways, Workers Paid is 10M logs per gateway, Logpush at $0.05 per million beyond. Best fit: anyone already on Cloudflare.

Vercel AI SDK is a different category — not an infrastructure gateway but a TypeScript library (ai package) giving generateText, streamText, tool use, and structured output with a unified API across providers. The cheap abstraction that buys most of the “I want to swap providers later” benefit without standing up infra. Apache 2.0, free. Covered also in 3.16 as an agent framework.

Kong AI Gateway is Kong’s existing API gateway with AI plugins layered on. Best fit: companies already running Kong for REST APIs. Greenfield AI-only deployments should pick LiteLLM or Portkey.

Helicone (gateway mode) is YC W23, Apache-2.0. The same caveat as 3.11 applies: the source material contains conflicting reports about whether Mintlify acquired Helicone on March 3, 2026 and put the platform into maintenance mode. One section of the research treats this as established fact and recommends migration to LiteLLM or Portkey; another section recommends Helicone Pro for production use. Until you can verify directly with the vendor, treat the maintenance-mode claim as the cautious assumption and route new gateway deployments to LiteLLM or Portkey.

Where it fits in a Hestiia-shaped stack

Hestiia is a 20-person hardware company on Anthropic Team plan, calling Claude from internal tools at 100–1,000 events/day per agent. You do not need a gateway today. The right move is the cheap abstraction at the code layer so the swap stays cheap when needed. For TypeScript code, standardize on Vercel AI SDK or the agent framework’s provider abstraction. For Python and Rust, wrap Anthropic calls behind a thin internal interface — one file, three functions. Re-evaluate the gateway question when any of the following lands: a second model provider, more than five deployed agents needing shared budgets and keys, or a compliance ask for centralised audit logs. At that point, default to LiteLLM proxy (self-hosted, MIT, ~$500/month of infrastructure) or Cloudflare AI Gateway (free if already on Cloudflare).

The trap is the symmetric one. Not “we picked the wrong gateway” but “we built the app so coupled to @anthropic-ai/sdk that adding a gateway later is a two-week refactor.” Solve that with one wrapper file today, then stop.

Verdict

No gateway today; one wrapper file at the code layer. Default to LiteLLM proxy when the gateway question becomes real.

Low-Code and No-Code Platforms

3.14 Low-Code and No-Code Platforms

What it is

Hestiia’s CTO has already decided this is a developer-only agent farm, so why spend any time on n8n or Zapier? Three reasons. Completeness — a CTO who does not know the boundary will be ambushed when sales ops asks for something. Calibration — the ceiling of what a low-code platform can do in 2026 is genuinely higher than it was in 2024, with most platforms now shipping AI Agent nodes and MCP integration. And the carve-out — even a code-first org has a long tail of one-off integrations where a hosted iPaaS is rationally cheaper than a custom service.

Leading vendors

n8n is the self-hostable fair-code workflow tool. Since 2024 the LangChain-based AI Agent node has made it the default OSS answer to “I want a Zapier I can run myself with LLMs in the loop.” Community Edition (self-hosted) is free under a fair-code Sustainable Use License with unlimited executions and workflows. Cloud Starter €20/month annual (2,500 executions, 5 concurrent, 50 AI credits). Cloud Pro €50/month annual (20 concurrent, 150 AI credits). Cloud Business €667/month annual (40K executions, SSO/SAML/LDAP, Git versioning, environments, self-host option). Enterprise custom.

Activepieces is a cleaner AI-first OSS competitor to n8n, MIT-licensed core. Cloud free tier, Plus $25/month (10 active flows, unlimited tasks, AI Agents, internal data tables), Business $150/month, Embed $30K/year. Self-hosted Community is genuinely free with unlimited tasks. The 2026 pitch is 400+ MCP servers baked in — not a small claim if you actually need that integration breadth.

Pipedream is the most developer-friendly of the lot — a code-first workflow platform where every step can be Node.js, Python, Go, or Bash with full npm/pip access, but with hosted triggers, hosted auth (OAuth-as-a-service for 2,500+ APIs), and hosted execution. The 2026 headline is the Pipedream MCP Server, exposing 10,000+ tools across 3,000+ APIs. Free 100 credits/day (3 active workflows, 3 connected accounts, 2M AI tokens). Basic $29/month (2,000 credits/day). Advanced $79/month (10,000 credits/day). Business custom. A credit is roughly 30 seconds of compute at 256 MB.

Windmill (windmill.dev, AGPLv3, YC) is the only one in this category that respects a code-first taste — code-first scripts-as-the-primitive, written in Rust (worker) with TypeScript/Python/Go/Bash scripts as the unit of work. AI Agent steps inside flows, every major provider, structured outputs, streaming, multimodal — and can both consume MCP servers as tools and expose its own scripts and flows as an MCP server. Self-hosted free under AGPLv3; Cloud has free and premium tiers; operators (execute-only users) cost half a developer seat.

Make (formerly Integromat, Celonis-owned) uses credit-based pricing — Free 1,000 ops/month, Core $9/month, Pro $16/month, Teams $29/month (all 10K credits, annual). 3,000+ connectors, Make AI Agents, Maia AI builder. SOC 2 Type II, GDPR.

Zapier is the default — 7,000+ integrations, conversational builder, mature SOC 2, 99.99% SLA. Zapier Agents is now a separate billable line: Free 400 activities/month, Pro 1,500/month, plus add-ons. Stacking standard Zaps + Agents Pro + a Chatbot routinely lands at $150–200/month per user.

Lindy.ai is the agents-as-a-product play — natural-language-authored agents for non-technical operators, with computer-use, voice phone agents (Gaia), and 5,000+ integrations powered by Pipedream Connect under the hood. Free 400 credits/month. Starter $19.99 (2K credits). Pro $49.99 (5K credits). Business $299/month (30K credits + 100 phone calls). Enterprise custom. Lindy is the consumer of an agent platform, not the agent platform itself.

Tray.io / Workato are the enterprise iPaaS heavyweights. Workato Standard starts ~$10K/year, Business $60–120K/year, Enterprise $84–180K/year, Workato One (agentic) $144–216K/year. Tray.ai is sales-led, list ~$99/month Pro through $599/month Enterprise but realistic deployments start ~$1K/month. Both are out of scope on cost alone.

Where it fits in a Hestiia-shaped stack

The narrow window where a low-code platform actually beats code has four axes — linear low-branch workflow, non-engineer authors, established SaaS integrations, and LLM-as-feature-not-core-logic. Hestiia is on the wrong side of every axis. The defensible carve-outs are three narrow scenarios. First, sales-ops glue (Pipedrive ↔ Slack ↔ Gmail ↔ calendar) — give jens-ptz Pipedream or Make at €20/month and let him author the simple flows without engineering. Second, CEO/EA personal assistants — Lindy Free or Pro per individual, no infrastructure burden. Third, an internal scripts-and-ops portal if Hestiia ever needs a Retool replacement — Windmill self-hosted is the only one of the eight that respects the code-first instincts of the engineering team.

Verdict

Not the agent farm. Pipedream or Make for sales-ops at €20/month, Lindy Pro per individual for personal assistants, Windmill self-hosted if a Retool replacement is ever needed. Everything else is built in code in the existing Rust + NestJS + Python stack.

The MCP Ecosystem

3.15 The MCP Ecosystem

What it is

The Model Context Protocol turns 16 months old in April 2026 and the ecosystem has settled into something recognisable: a stable spec, a governance body, a clear set of transports, and a maturing OAuth story. The latest spec is 2025-11-25, and the biggest political event was December 2025: Anthropic donated MCP to the Linux Foundation. (One internal source describes the recipient as “the Agentic AI Foundation, a directed fund under the Linux Foundation”; public reporting at the time referenced the Linux Foundation directly. The substance is unchanged — MCP is no longer a single-vendor standard.) MCP now has Core Maintainers, a contributor ladder, and four 2026 priority areas: transport scalability, agent communication and async Tasks, governance maturation, and enterprise readiness.

Transports stabilised on stdio (local processes) and Streamable HTTP with SSE for server-initiated messages; the legacy “HTTP+SSE” two-endpoint transport is deprecated. OAuth was the big 2025-11-25 story — the spec mandates OAuth 2.1 with PKCE for public remote servers, replaces Dynamic Client Registration with Client ID Metadata Documents (CIMD) (clients publish a JSON file at an HTTPS URL and that URL is the client ID), and requires RFC 8707 Resource Indicators so tokens are bound to a specific MCP server. Other 2025-11-25 additions: an experimental Tasks primitive for long-running operations, a formal Extensions framework, standardised OAuth scopes, and explicit user consent on local-server install.

The macro trend is MCP-as-a-service — Cloudflare, Composio, Arcade, Pipedream, Smithery, and others now host MCP servers as a managed product. The honest April-2026 maturity caveat is that the spec is stable but implementations are not uniform; Tasks are still experimental; CIMD has only been in the wild ~5 months and the rough edges are real.

Leading vendors

Smithery (smithery.ai) is the leading MCP registry, marketplace, and hosting layer — the GitHub-of-MCP-servers, listing thousands of public servers as of April 2026. Browsing and installing is free; listing your server is free. Hosted execution is sold in Hobby / Pro / Custom tiers — usage-based, no published hard dollar amounts. There is no creator monetisation. The pitch is discoverability for public MCP servers (Cursor, Claude Desktop, etc.); for internal MCPs, Smithery adds nothing.

Composio (composio.dev) is a managed integration layer — started as Zapier-for-agents and pivoted hard into MCP in 2025. ~250+ apps with managed OAuth, per-end-user auth, action-level RBAC, and observability. Free $0/month (20K tool calls). Hobby $29/month (200K tool calls, $0.299 per 1K overage). Business $229/month (2M tool calls, $0.249 per 1K overage). Enterprise custom (SOC-2, dedicated SLA, VPC/on-prem). Premium tools (semantic search, code execution) bill at 3× the standard rate. Best fit: teams that need dozens of SaaS integrations with end-user OAuth.

Arcade.dev is Composio’s most direct competitor — same pitch (managed MCP runtime + per-user OAuth + governance + 100+ pre-built integrations) with a more developer-platform flavour. Free $0/month (100 user challenges, 1K standard tool executions, 50 pro executions, 1 hosted MCP server). Growth $25/month plus overages (600 user challenges, 2K standard, 100 pro, $0.05/hour for hosted MCP servers). Enterprise custom (dedicated tenant isolation, RBAC, SSO/SAML, SOC-2/HIPAA pathways). Notable: Arcade has a startup program for sub-100-employee companies — relevant to Hestiia’s size.

Cloudflare MCP / Workers MCP has become arguably the most important MCP infrastructure provider because it solves the boring-but-load-bearing problem: where does a remote MCP server actually run. Workers MCP gives you a CLI and library to expose any Cloudflare Worker as an MCP server, with cloudflare/agents providing higher-level “Durable Object as MCP session” patterns. The April 2026 announcement worth knowing is Code Mode: instead of dumping every MCP tool definition into the model’s context, the server exposes search() and execute() — agents query the OpenAPI/tool spec, then write small TypeScript snippets the Worker runs in a sandboxed isolate. For Cloudflare’s own ~2,500-endpoint API this cut input tokens from 1.17M to ~1K (a 99.9% reduction). This is the most important architectural pattern in MCP this year.

FastMCP (gofastmcp.com, prefecthq/fastmcp) is the dominant Python framework for building MCP servers. The original v1 was upstreamed into the official MCP Python SDK; the v2/v3 line continued with richer features. FastMCP 3.0 shipped January 19, 2026 with component versioning, granular authorization, OpenTelemetry instrumentation, and multi-provider OAuth. Some version of FastMCP powers ~70% of Python MCP servers, with roughly a million daily downloads.

Pipedream Connect ships a hosted MCP server fronting ~3,000 APIs and 10,000+ tools with managed end-user OAuth — pricing is Pipedream’s standard tier (covered in 3.14). Best fit: breadth above all (a sales agent touching fifty long-tail SaaS apps you do not want to write SDKs for).

mcp.run (Dylibso) is a different beast — a registry of WebAssembly servlets rather than process-based MCP servers. Servlets are sandboxed Wasm modules with capability-restricted access; the security model is “do not trust the tool, run it in a Wasm sandbox.” mcp-use is a fullstack Python and TypeScript framework — client + server + agent runtime in one. Think LangChain but MCP-native.

Where it fits in a Hestiia-shaped stack

Hestiia is already in a healthy place — the internal MCP Manager (a 30k-LOC local-first MCP gateway with permission, audit, and approval middleware in front of ten typed connectors and three OAuth-2.1 remote proxies, the local analogue of Composio for the developer-agent use case; treated in detail in §5.1) plus first-party pipedrive-managed is the right architecture for the current scale, and migrating to Composio or Arcade would buy almost nothing today. Where managed MCP would earn its keep at Hestiia, in priority order: an end-customer-facing agent (the day MyEko Pro ships an in-app assistant that touches a household’s Google Calendar or Home Assistant, you need per-end-user OAuth at scale — build that on Composio or Arcade, do not roll it yourself); a sales/marketing agent touching Pipedrive + Gmail + HubSpot + Linear + Notion (canonical Composio/Pipedream Connect use case); remote hosting for any MCP that has to be reachable from outside the VPC (Cloudflare Workers + Durable Objects + Access is the path of least resistance).

For the framework decision in Section A, pick one whose MCP client implementation tracks the 2025-11-25 spec — especially CIMD, Resource Indicators, and Tasks. Today that means Anthropic SDK, OpenAI Agents SDK, mcp-use, and LangChain. Build new internal servers on FastMCP 3 (Python) or rmcp (Rust), expose OTel from day one.

The MCP-first thesis

The conviction worth ending this chapter on is that of all the categories in Section B, MCP is the highest-confidence bet. Frameworks will rise and fall over the next 24 months — Mastra and LangGraph and OpenAI Agents SDK are all credible, all evolving, none guaranteed. Durable execution will consolidate. Observability will commoditise around OTel. But the integration surface — how an agent reaches a Pipedrive deal, an IoT shadow, a household calendar — converges on MCP regardless of which runtime wins. Hestiia’s bet should be MCP-first as architectural discipline: every internal capability that an agent might call is exposed as an MCP server with stable semantics, OTel instrumentation, and 2025-11-25-compliant auth. Whatever framework wins this year inherits those servers for free, and whatever framework wins in 2027 inherits them too.

Verdict

Keep the current internal-MCP-Manager + first-party pipedrive-managed architecture. Build new internal servers on FastMCP 3 (Python) or rmcp (Rust). Reserve Composio or Arcade for the end-user-OAuth case; Cloudflare Workers MCP for any externally-reachable server. Treat MCP-first integration as the highest-confidence architectural commitment in the entire stack.

Minor and Adjacent Frameworks

3.16 Minor and Adjacent Frameworks

What it is

A second-tier of agent frameworks worth understanding before committing — not the average startup’s first pick, but each represents a distinct philosophy. Three reasons to know them: foundational context (LangChain remains the substrate that LangGraph and parts of Haystack build on); real alternatives (DSPy and Atomic Agents reject the agent-loop abstraction entirely); and niche fits (Burr for state machines, Haystack for retrieval-heavy workloads).

Leading vendors

Smolagents (Hugging Face) is a ~1,000-line Python library where the agent thinks in code: emits Python, executes it in a sandbox (E2B, Modal, Docker, or Pyodide+Deno), observes the result, loops. Apache 2.0, free. The killer feature is code-as-actions — letting the LLM emit Python instead of structured tool calls is empirically more efficient. Deliberately minimal and research-flavoured; no production telemetry, no durable execution. Not Hestiia’s pick but worth knowing as the canonical “code-as-actions” reference.

DSPy (Stanford NLP) is a framework for “programming, not prompting” — write Python signatures, DSPy compiles them into prompts, optimisers like MIPROv2 and GEPA tune them against a metric. Apache 2.0, free, ~250 contributors. The killer feature is the optimisers; 10–40% gains over hand-tuned prompts on structured tasks are documented. Honest dark-horse fit at Hestiia: classifying device failure modes from telemetry, where there is labelled data and a metric. Not for the sales agent farm.

Strands SDK (AWS) — covered in detail in Section A as the SDK paired with AWS Bedrock AgentCore. Briefly: Apache 2.0 SDK, free; provider-agnostic by design (works with Bedrock, Anthropic direct, OpenAI, Llama, Ollama). Right answer for AWS-committed shops; less interesting if you are already on Anthropic Team plan.

Burr (Apache, ex-DAGWorks) is a Python state-machine framework — define actions and conditional transitions, Burr handles persistence, telemetry, and a built-in debugging UI. Donated to the Apache Software Foundation, currently incubating; commercial Apache Burr Cloud listed as “Coming Soon” on their pricing page. Right answer when your agent genuinely is a state machine. Hestiia has no urgent FSM use case.

Haystack (deepset) is the production LLM framework with deep retrieval roots. Apache 2.0 OSS; Haystack Studio (managed) free for 1 workspace, 1 user, 100 pipeline hours; Haystack Enterprise Platform custom-quoted. Best-in-class retrieval. Not Hestiia’s centre of gravity — the likely agent use cases are not retrieval-heavy.

Microsoft Agent Framework (microsoft/agent-framework, MIT, Python and .NET, ~9.8K stars) reached v1.0 GA on 2026-04-07 and now sits as the supported successor to both AutoGen and Semantic Kernel — both projects have officially moved to maintenance, and Microsoft’s recommendation is that any new project starts on MAF. MAF unifies the two heritage frameworks into a single typed API for agents, workflows, and threads, ships with first-class hooks for Azure AI Foundry Agent Service (the hosted runtime equivalent), and has full OTel instrumentation. The interesting strategic detail is that MAF v1.0 is the product the book’s §4.4 casualty list predicted — AutoGen’s roadmap quietly closed in late 2025, and the consolidation into MAF is the corporate version of the same outcome. For a Hestiia stack with zero .NET surface and a Python centre of gravity, MAF remains skip — the SDK is more interesting on the .NET side than the Python side, and the value-add over PydanticAI or the Anthropic Agent SDK is mostly the Foundry tie-in that Hestiia is not buying. Worth naming once for completeness; not a procurement candidate.

Vercel AI SDK — covered in 3.13 above as a code-layer abstraction. AI SDK 6 (2026) added a clean Agent interface with ToolLoopAgent as the default; provider-agnostic; MCP-native. Apache 2.0, free. Right answer for any new TypeScript web product (an internal admin tool with embedded agents). Hestiia has no Next.js footprint, so the case is not urgent today.

Raw LangChain is the original building-blocks library — prompts, models, parsers, document loaders, vector store wrappers, chains. MIT, free. In 2026, almost never the right pick over LangGraph. Use LangGraph for anything stateful, looped, or multi-step; raw LangChain only makes sense for strictly linear pipelines, which is a category that barely exists in agent farms.

Atomic Agents (BrainBlend AI) is the anti-framework framework — a tiny Python library built on Pydantic and Instructor. MIT, free. The killer feature is predictability: stack traces are normal Python, schema-driven I/O means everything is typed and testable. The dark-horse fit at Hestiia: the engineering culture is Rust embedded with schema-driven message buses (protobuf via aroma, Pydantic-equivalent rigour), and Atomic Agents’ philosophy of typed I/O, no magic, just composition maps cleanly onto that taste. Downside is community size — small independent shop, not a YC/VC-backed company.

OpenHands (formerly OpenDevin, ~40K stars) is the leading OSS autonomous-coding agent — the open-source reference implementation of the Devin pattern. Apache 2.0, free. Belongs in the same mental bucket as Smolagents but oriented at coding rather than general action. Not Hestiia’s stack to build on; worth knowing as the OSS comparable for the commercial Devin (3.21).

CAMEL-AI is the most-cited multi-agent research framework — academic-rooted, role-playing-heavy, foundational for the subsequent multi-agent literature. Apache 2.0, free. Reference and citation set, not infrastructure to deploy.

Where it fits in a Hestiia-shaped stack

Zero of these are top picks for Hestiia, but two are real dark horses. Atomic Agents is the most likely to feel native if a backend engineer needs to add an agent loop somewhere — the schema-driven taste matches the embedded culture. DSPy is the right answer for one specific job: auto-classifying device failure modes from telemetry. Both are free; both can be prototyped in a day; neither needs to displace the primary framework choice from Section A.

Verdict

Skip as primary picks. Keep Atomic Agents in mind for any new typed-Python agent work; reach for DSPy the day someone needs to build a classification or extraction pipeline against labelled data.

Agent Memory

3.17 Agent Memory

What it is

The treatment of memory across the agent landscape has been incoherent. Mastra calls it Memory; LangGraph calls it the checkpointer; Anthropic Skills treat it as a file system; OpenAI’s stateful Responses API treats it as a server-side conversation; vector DBs treat it as RAG. The result is that “memory” gets discussed as a sub-feature of every framework but rarely as a category in its own right. The category is real: a dedicated long-term memory service for agents, sitting alongside the framework runtime, owning the user-and-session state that needs to persist beyond the lifetime of any one prompt or any one workflow.

The distinction from RAG matters. RAG retrieves from a static or slow-changing corpus; agent memory accumulates from the agent’s own interactions over time, with notions of recency, relevance, contradiction, and consolidation. “What did we already learn about this MoA?” — across a 200-deal Pipedrive pipeline, six months of Slack conversations, and three calls — is a memory problem, not a RAG problem.

Leading vendors

Mem0 (YC-backed, OSS-core plus cloud) is the most-marketed of the three, with a Python SDK that abstracts memory operations behind add, get_all, search, update, delete — pluggable into any agent framework. Free tier on cloud; $19/month Pro; Enterprise custom. OSS self-host path is real (the core engine is open).

Letta (formerly MemGPT) is the academic-rooted memory framework — comes out of the original MemGPT paper from Berkeley, treats memory as a tiered architecture (in-context vs. archival, with an LLM-managed paging layer between them). OSS-first; cloud product available; pricing varies by tier. The intellectual depth is the strongest of the three.

Zep is the most production-mature, with temporal knowledge graphs as its differentiating primitive — relations between entities are timestamped, so the agent can reason about “what was true in January but is not true now.” Zep Cloud Starter $39/month; OSS self-host option (Zep Community Edition) exists but is materially less differentiated than the cloud product because the temporal-knowledge-graph engine sits in a separate library. That library is Graphiti (getzep/graphiti, Apache-2.0, ~25.5K stars), the standalone OSS graph-construction engine that is the property the cloud Zep is sold on — pip install graphiti-core plus a Neo4j or FalkorDB backend gives you the temporal-knowledge-graph primitive without Zep Cloud at all. For a Hestiia “what do we know about this MoA, and what was true last quarter that may not be true now” use case, Graphiti standalone is the OSS-purist self-host answer; Zep Cloud is the managed shortcut.

Pinecone Assistant is a wrapper over Pinecone vector DB for chatbots — managed RAG-plus-memory in one product. Pricing follows Pinecone’s standard usage tiers. Best fit if you are already a Pinecone shop. Hestiia is not, and pgvector on RDS is cheaper and already in stack.

Where it fits in a Hestiia-shaped stack

The current CLAWD-SALES-AGENT presumably reasons over deal history, prior calls, and prior Slack context — and today that is stuffed into prompt context or pulled from Pipedrive/Slack on demand. As deal count grows past a threshold (a few hundred active deals, or once a single deal accumulates dozens of interactions), the prompt-context approach starts to break — too many tokens, too much noise, the model hallucinates that something was discussed when it was not. At that point, dedicated memory becomes worth a real evaluation.

The right Hestiia move today is to defer the decision. The agent farm is not yet at the scale where memory-as-a-product earns its keep, and pgvector + a hand-rolled summary loop covers the current need at near-zero cost. When the threshold is hit, Zep is the strongest commercial pick (temporal knowledge graphs are uniquely useful for sales intelligence over time), with self-hosted Letta as the OSS-purist alternative. Mem0 is fine but the differentiation is weaker. Pinecone Assistant is a category the team should explicitly skip — the value lives upstream in whichever knowledge graph is doing the actual reasoning.

Verdict

Defer for six months. When deal volume or Slack-context volume forces the question, prototype Zep Cloud Starter ($39/month) for the sales agent’s “what do we know about this MoA?” memory; keep Letta self-hosted as the OSS escape hatch.

Code Sandboxes and Browser Automation

3.18 Code Sandboxes and Browser Automation

What it is

These are two distinct categories that tend to get conflated because they answer the same meta-question — “how does my agent take action when there’s no API?” — but they answer it differently. Code sandboxes give the agent a Linux VM in which to run Python, JS, or Bash that the agent itself just wrote. Browser automation gives the agent a headless Chrome and a structured way to click, type, and read. A serious agent farm needs an answer to both questions, and they are the categories most often missing from CTO-level treatments of the stack.

Code sandboxes

E2B is the standard answer. Hosted code-execution sandboxes designed specifically for agents — boot a Firecracker VM in under a second, run untrusted Python or JavaScript, return results, snapshot state, destroy. The Smolagents pattern of “let the LLM emit Python” relies on something like E2B as the execution substrate; Anthropic Skills with Bash similarly need a sandbox; any data-analysis agent that ingests a Pipedrive export and runs pandas needs a sandbox. Pricing: Pro $150/month for 8 hours of concurrent compute; pay-as-you-go ~$0.000014 per CPU-second beyond. Free tier exists for prototyping. The pitch over rolling your own is that E2B is one API call versus roughly two engineer-weeks of fragile work to make Docker-in-Docker-with-resource-limits behave on AWS.

Modal sandboxes are a feature inside Modal’s broader “serverless GPU” platform — modal.Sandbox spins up an isolated container running on Modal’s infra, pricing folds into Modal’s per-second compute model (CPU sandboxes are cheap, GPU sandboxes track Modal’s GPU rates). Best fit if Modal is already part of the stack for ML inference; otherwise E2B is more focused.

Daytona is the “dev-environment-as-a-service” play that pivoted into agent sandboxes. Open-source core, cloud product. Less mature than E2B for the agent use case but a real alternative if you want self-host as the default posture.

Pyodide + Deno is the in-browser / WASM path — run untrusted Python via Pyodide in a Web Worker, or untrusted JS via Deno’s permission-based runtime. Free, no infrastructure, but limited (no native libraries beyond what Pyodide ships). The right answer for bounded code execution where the input universe is known.

Browser automation

Browserbase is the dominant managed headless-browser-for-agents — Series B, ~$40M raised. Their pitch is that running headless Chrome at scale on AWS for an agent workload is a solved problem you should not solve again: stealth, captcha handling, session persistence, debug recordings. Pricing: free tier for prototyping; ~$0.05/minute session pricing on usage tiers; team plans $99–$499/month with concurrency and retention upgrades. Stagehand is Browserbase’s TypeScript framework on top — declarative page.act("click the login button") and page.extract({...}) primitives that let an LLM drive a page without you having to write CSS selectors that break on the next deploy. The combination is the polished answer for any agent that has to do “log in, navigate, fill a form, download a PDF” against a site with no API.

Browser Use is the OSS Python equivalent — MIT-licensed, free, the model writes structured browser actions and Browser Use executes them. Less polished than Stagehand for production but excellent for prototyping and self-host. ~91K GitHub stars as of April 2026 (the project nearly doubled in late 2025), very fast-moving. The most-credible OSS alternative for forms-and-workflow automation specifically is Skyvern (Skyvern-AI/skyvern, AGPL-3.0, ~21K stars), with a more workflow-shaped abstraction than Browser Use’s lower-level browser-action primitive — the AGPL is the trade-off for a Hestiia internal-only deployment but does not block it. The MCP-native answer is Microsoft Playwright MCP (microsoft/playwright-mcp, Apache-2.0, ~32K stars), the official Microsoft-maintained Playwright-as-MCP-server — for any agent that needs browser automation through MCP rather than a separate SDK, this is the OSS default and slots directly into the MCP-first discipline.

The Browserbase-versus-Browser Use call mirrors the E2B-versus-Daytona call: hosted-and-polished versus OSS-and-self-host, with the same trade-off (engineer-weeks saved versus monthly bill incurred).

Where it fits in a Hestiia-shaped stack

The honest assessment is that Hestiia has a real but narrow need on both. On code sandboxes: the current sales agent does not need one (it is structured Pipedrive actions), but the day someone builds a CCTP-analysis agent that runs pandas over an exported tender spreadsheet, E2B is the right answer. On browser automation: a non-trivial fraction of Hestiia’s commercial workflow touches French administrative portals (RE2020 dossier sites, BET-portal logins, supplier portals) that have no API. Today this is handled by humans; an agent that can navigate those portals is a real productivity unlock for jens-ptz and the sales-ops function. Browserbase + Stagehand at $99–$499/month for that specific workflow is rational spend the day someone has the bandwidth to build it.

The right architecture is to treat both as MCP servers: a Hestiia-internal code-execution-managed MCP server fronting E2B (so any agent can call it through the same MCP discipline as Pipedrive), and a browser-managed MCP server fronting Browserbase. This keeps the integration surface uniform — the framework choice in Section A does not change anything — and means switching from Browserbase to self-hosted Browser Use later is a one-file change in the MCP server.

Verdict

E2B for sandboxes the day a code-emitting agent ships; Browserbase + Stagehand for browser automation when the French-admin-portal workflow gets prioritised. Wrap both as internal MCP servers so the framework layer stays clean. Budget: $0 today, ~$250/month combined when both come online.

Computer Use

3.19 Computer Use

What it is

Computer Use is the “last-resort integration.” When there is no API, no MCP, no headless-browser shortcut, and no scraping path that survives a redesign, the model gets a screenshot and a virtual mouse and keyboard. This is the most general possible integration surface — anything a human can do at a desk, the agent can theoretically do — and the most expensive, the most fragile, and the slowest. The category exists because real enterprises run on real software that does not have APIs, and a CTO doc that pretends otherwise is dishonest.

Leading vendors

Anthropic Computer Use is the canonical implementation. Distinct API surface from the Agent SDK, accessed via a beta header (anthropic-beta: computer-use-2024-10-22 historically, with versioned successors). The model receives screenshots as image inputs and emits computer tool calls — screenshot, mouse_move, left_click, type, key, scroll — that you execute against a virtual desktop you provide (typically a containerised Linux desktop running Xvfb). Billing is standard Claude token rates; the cost driver is image input tokens for screenshots (a 1280×800 PNG runs ~1,200 input tokens at typical compression, so a 30-step task can easily hit 50K input tokens just on screenshots before any reasoning). Anthropic ships a reference container image that bundles Xvfb, Firefox, a desktop environment, and the Computer Use bridge — boot it, point Claude at it, watch the demo.

OpenAI’s Computer Use equivalent shipped in 2025 alongside the Operator product line — same architectural pattern (screenshot in, mouse/keyboard out), bundled into the Responses API with tools: [{type: "computer_use"}]. Pricing is GPT-4o image-input rates, which run higher per token than Claude on equivalent screenshots. The two are essentially feature-parity for the canonical use cases; the choice tracks the rest of your provider commitment.

OS-level alternatives are a rougher set of OSS projects — pyautogui plus a vision model (any LLM with image input), Microsoft’s UFO project, Skyvern (browser-only but in this neighbourhood), the various academic “OS-World” benchmark implementations. None are production-ready in the way Anthropic’s offering is; all are roughly two engineer-weeks of work to make reliable. They exist to remind you that Computer Use is not magic — it is a screenshot loop, and you can build one yourself if vendor lock-in is the dominant concern.

Where it fits in a Hestiia-shaped stack

The narrow-but-real Hestiia need is exactly what Computer Use solves and Browserbase does not quite reach: legacy desktop software (some BET-portal client applications, certain French regulatory tools, supplier-facing software that exists only as a Windows application). For every such workflow, the order of preference is: (1) is there an API or MCP — yes, use it; (2) is it a browser-only app — yes, Browserbase + Stagehand; (3) is it a desktop app — only then, Computer Use.

The cost discipline matters. A 30-step Computer Use task at Claude Sonnet rates can run $0.10–$0.50 per execution in screenshot tokens alone. That is fine for a once-a-day task; ruinous for a once-a-minute task. Computer Use is the correct tool for low-frequency, high-friction workflows — the supplier portal you check twice a week, the regulatory site that requires a quarterly upload — and the wrong tool for any high-volume integration where the right answer is to invest in an actual API.

A second discipline: Computer Use tasks should run inside a virtual desktop that the agent owns end-to-end (boot fresh, run task, snapshot, destroy), not against the agent operator’s actual machine. The reference container is a reasonable starting point; in production you want this on E2B or a Modal sandbox or a dedicated EC2 spot fleet, with no shared filesystem to the host.

The pricing-with-caveat note: in current public Anthropic and Bedrock pricing, Sonnet 4.6 cache-read on Bedrock is at parity with Anthropic’s $0.30/M as of Q1 2026 per AWS pricing updates. The historical “Bedrock cache-read is more expensive than Anthropic” observation was true for the legacy 3.5 family but should not be repeated as a current claim. For Computer Use specifically, prompt caching of the Anthropic-provided system prompt and tool definitions is a real cost lever — it cuts the per-turn input cost by an order of magnitude on long sessions — and it works equivalently on both providers now.

Where Computer Use is the right call versus the wrong call

Right calls. Quarterly French regulatory uploads where the website is a Drupal-era portal with no API. A supplier portal that emits a monthly PO PDF. Onboarding workflows where the BET assistant has to log into a tool the engineering team has no contract with. Any task where the alternative is “ask jens-ptz to spend an hour clicking buttons.”

Wrong calls. Anything where the volume exceeds 100 runs per day. Anything where latency under 30 seconds is a requirement. Anything where the underlying app changes weekly (Computer Use is fragile to UI changes; the model’s ability to find a button is roughly as good as a human’s, but it has no way to learn that the button moved). Anything where the security boundary matters and you cannot afford to give an LLM a logged-in session.

Verdict

Reserve Computer Use as the explicit fallback for low-frequency, no-API workflows. Build the integration as an MCP server (computer-use-managed) so the agent framework does not see the difference between a Pipedrive call and a portal-click sequence. Default to Anthropic Computer Use given the existing Anthropic Team plan; switch to OpenAI’s equivalent only if a specific workflow benefits from GPT-4o’s vision quirks. Budget: a few dollars a day of token cost per active workflow, plus E2B or equivalent for the desktop substrate.

Voice Agents

3.20 Voice Agents

What it is

Real-time voice agent platforms glue speech-to-text, an LLM, text-to-speech, and telephony together with sub-500ms latency targets. The category exists because building this stack from scratch (Twilio + Deepgram + GPT-4o + ElevenLabs + a careful WebRTC pipeline) is two engineering quarters of work, and the failure modes (turn-taking, interruption, hallucinated phone numbers) are subtle. By 2026, “AI SDR that calls a lead” is the single fastest-growing agent product category, which is why a CTO doc that omits voice looks 12 months stale even when the recommendation is “skip for now.”

Leading vendors

Vapi is the developer-platform pick. Sub-500ms target latency, BYO LLM, Twilio passthrough or their own SIP, recording, post-call analytics, function calling. Pricing: $0.05 per minute on the Vapi side plus provider passthrough (LLM tokens, STT, TTS billed at the underlying providers’ rates). Real all-in cost lands around $0.10–$0.15/minute for a well-tuned agent.

Retell is Vapi’s most direct competitor with a slightly more product-oriented framing — pre-built agent templates, knowledge-base integration, voice-cloning, native CRM hooks. Pricing: $0.07–$0.10/minute Retell-side plus provider passthrough.

Bland AI is the third in the Big Three, positioned as the fastest to set up — natural-language agent definition, an internal LLM tuned for phone latency, North-American number provisioning. Pricing is on the same per-minute basis ($0.09–$0.12/minute typical). Bland’s pitch is “we own the stack end-to-end, you just describe the agent.”

There are also adjacent products that matter as reference points: ElevenLabs Conversational AI (the TTS leader’s own voice-agent product), OpenAI Realtime API (the lower-level GPT-4o-mini-realtime endpoint that all of these vendors increasingly ride on top of), and Lindy’s Gaia (3.14 above — voice agents bundled into the Lindy platform).

The book has so far named only commercial vendors in this category, and that is a real omission, because the OSS voice-agent framework set is the substrate underneath. LiveKit Agents (livekit/agents, Apache-2.0, Python, ~10.3K stars) is the single most-credible OSS voice-agent framework in 2026. LiveKit itself runs the WebRTC infrastructure as OSS; LiveKit Agents is the framework that lets you plug an LLM (any provider) into a real-time room with sub-500ms latency. Self-host is genuinely turnkey via the LiveKit Cloud OSS stack or a self-managed deployment. Pipecat (pipecat-ai/pipecat, BSD-2-Clause, Python, ~11.6K stars) is Daily.co’s open-source voice-agent framework — same shape as LiveKit Agents (STT → LLM → TTS pipeline, real-time orchestration), less mature operationally but actively developed and more permissively licensed. Together these are the OSS substrate that makes the Vapi-versus-self-host decision a real one rather than a captive market: at scale, “buy Vapi at $0.05/minute” versus “self-host LiveKit Agents and pay only the underlying STT + LLM + TTS costs” lands at roughly the same total cost per minute for a well-tuned agent, with the LiveKit path landing at ~$0.04/minute in third-party costs (Deepgram Nova-3 plus Anthropic Sonnet plus ElevenLabs at standard rates) plus the LiveKit-operations burden. For a 20-person company at the prototype stage the buy decision still wins; the day voice volume justifies removing the per-minute platform tax, the OSS path is real and named.

Where it fits in a Hestiia-shaped stack

The Hestiia call here is borderline. The current sales agent is text/Slack and there is no urgent voice requirement. But the commercial pipeline includes BET callbacks, installer phone outreach, and lead-qualification calls that today consume human time. A voice agent that handles the first-touch “are you the right contact, when can we schedule a deeper conversation” call could save the sales team meaningful hours per week. The risk is brand — a poorly-tuned AI SDR is worse than no SDR, and Hestiia’s premium positioning (a French heater company selling to professional MoA/MoE/BET buyers) makes the brand cost of a bad call meaningful.

The right move is to defer until a specific use case is identified, then prototype on Vapi for two weeks before committing. The full stack (Vapi + Claude Sonnet 4.6 + Deepgram Nova-3 + ElevenLabs voice) lands around $0.12–$0.18/minute and a couple of engineer-days of integration. If the prototype works, scale within Vapi; if not, the sunk cost is bounded.

Verdict

Defer; budget for a Vapi prototype if and when an inbound-callback or first-touch-outbound use case gets prioritised. Skip Lindy Gaia (covered in 3.14) — same capability, less control.

Coding-Agent Reference Products

3.21 Coding-Agent Reference Products

What it is

This is not infrastructure Hestiia would buy for the sales-agent farm. It is the reference set every executive reader of this document already has opinions about — Devin, Cursor Agent, Replit Agent, Lovable, Copilot Workspace — because they are the products that defined what “an agent” looks like in the public consciousness in 2025 and 2026. A CTO doc that does not name them loses credibility instantly. The right framing is: “what does a polished, end-user-facing agent product look like, what does it cost, and what infrastructure does it run on?”

Alongside the commercial set sits a parallel open-source reference set — opencode, Aider, OpenAI’s Codex CLI, Cline, Continue, Goose, Plandex, OpenHands — which is what the engineering team is actually running on its own laptops at zero seat cost, and which is the honest baseline against which any $20-to-$500-a-month commercial agent has to justify itself. A masterclass that names only the commercial half misrepresents the price ceiling.

Leading vendors

Devin (Cognition, ~$2B valuation as of 2025) is the canonical autonomous-coding agent — chat with it like a junior developer, it spins up its own VM, plans the task, writes the code, runs the tests, opens the PR. Pricing: $500/month per team plus ACU (Agent Compute Unit) usage on top. The pricing is the news — Cognition’s positioning is explicitly “this replaces a junior engineer at one-tenth the cost,” and the $500 floor is meant to make that comparison legible.

Cursor Agent is the IDE-native counterpart — Cursor (the AI-first IDE) ships an Agent mode that does the same task pattern (plan, execute, iterate) but inside the editor where the developer is already working. Pricing: Cursor Pro $20/month; Cursor Business $40/seat; Agent usage is bundled into the standard plan with usage-based overage on the more expensive models.

Replit Agent is the in-browser-IDE counterpart, pitched at the “I want to build a full-stack app from a prompt” audience. Pricing: Replit Core $25/month; Agent tasks consume “checkpoints” out of the plan’s allotment. Notable architectural detail: Replit Agent runs on Mastra under the hood — which is itself an argument for Mastra’s production credibility (covered in Section A).

Lovable is the no-code product builder — describe a SaaS app in natural language, Lovable generates the React/Tailwind/Supabase scaffold, deploys it. Pricing is usage-tiered around generated-app complexity, with paid plans starting around $20/month.

GitHub Copilot Workspace is GitHub’s answer to the same brief — a planning-and-execution surface that lives next to a GitHub repo, drives PRs end-to-end. Pricing rolls into Copilot Business ($19/seat/month) and Enterprise tiers, with Workspace bundled rather than separately metered.

Open-source coding-agent CLIs (the engineering team’s actual default)

The commercial products above set the price ceiling and the public reference shape; the open-source CLIs set the floor. Every one of these is free at the seat level, runs against whatever model API key you bring, and is what individual engineers reach for when nobody is paying per chair. Listing them in the order an executive should know them:

opencode (anomalyco/opencode, MIT, TypeScript-and-Go terminal UI, ~151K stars on GitHub as of April 2026) is the most-starred OSS coding-agent product full stop — a provider-agnostic terminal-native CLI that speaks to 75+ model providers including local Ollama, and is the closest open-source analog to Anthropic’s closed-source Claude Code. The pitch is exactly that decoupling: the same agent loop, the same tool surface, but the model is an environment variable rather than a vendor commitment.

Aider (Aider-AI/aider, Apache-2.0, Python, ~44.1K stars) is the original — the OSS coding-agent CLI that predates the category having a name, still actively developed, still the reference for “Git-aware diff-and-commit agent in a terminal.” Aider is the proof point that the pattern works without a hosted runtime; it runs locally, edits files in place, makes commits, and treats your repo as the source of truth.

OpenAI Codex CLI (openai/codex, Apache-2.0, Rust, ~78.5K stars) is OpenAI’s official open-source twin to its hosted Codex product — the same agent loop, written by the vendor, shipped under a permissive license. The strategic read is that OpenAI itself does not believe the closed-source CLI is a defensible moat; the moat is the model.

Cline (cline/cline, Apache-2.0, TypeScript, ~61.1K stars) is the leading VS Code-resident agent — an extension rather than a standalone CLI, running the agent loop inside the editor a developer is already in. This is the closest OSS analog to Cursor Agent’s editor-native shape, minus the $20/seat.

Continue (continuedev/continue, Apache-2.0, TypeScript, ~32.9K stars) is the other major VS Code agent extension, with a slightly more “pluggable autocompletion plus chat” framing than Cline’s pure-agent posture. The two coexist; engineers tend to pick one and stay.

Goose (block/goose, Apache-2.0, Rust, ~43.5K stars) is Block (Square) Inc.’s open-source agent — the enterprise-leaning entry in the set, with SOC-2-aware framing and a corporate sponsor whose own production infra runs on it. Goose is the right answer when the procurement question is “is this thing safe to run inside a regulated company,” because Block already had to answer it for themselves.

Plandex (plandex-ai/plandex, MIT, Go, ~15.3K stars) is the plan-first OSS agent for multi-file long-horizon tasks — the open-source attempt at the Devin shape, where the agent decomposes a brief into a plan and works the plan across many files. Worth flagging that activity is slowing — the last push was 2025-10-03 — and the trajectory looks more like maintenance than momentum; treat it as a reference design rather than a production bet.

OpenHands (formerly OpenDevin, OpenHands/OpenHands, MIT, Python, ~72.3K stars) is the open-source Devin reference — the agent runs inside a sandboxed VM with full screen access, which makes it simultaneously a coding agent and a computer-use harness. This is the only entry on this list whose architectural shape matches Devin’s autonomous-VM model rather than the local-CLI model, which is precisely why it is the one to study if the question is “what does an OSS Devin actually look like.”

The verdict, dated April 2026: at zero seat cost, this set is the engineering team’s actual default — what individual engineers run on their own machines whether or not the company has bought anything. The existence of this set is the comparison anchor for the question “is Cursor at $20/seat or Devin at $500/month actually worth it,” and the honest answer is that it is, but only for the specific workflows where commercial polish (Cursor’s IDE integration) or autonomous long-horizon execution (Devin’s VM-and-PR loop) earns the per-seat cost. For Hestiia specifically, individual engineers should adopt opencode or Cline as their default — opencode for terminal-native, provider-agnostic work, Cline for editor-resident — and reserve Cursor and Devin spend for the workflows that demonstrably justify it. Claude Code is the closed-source counterpart on the Anthropic side; opencode is its closest provider-agnostic open-source analog, and the right way to think about the pair is that one is the polished single-vendor experience and the other is the portable insurance policy.

Where it fits in a Hestiia-shaped stack

This category is reference, not procurement. The Hestiia-internal angle: Cursor Agent at $20/seat for engineering productivity is rational individual-license spend; Devin at $500/month for the team is harder to justify until a specific use case (long-running maintenance tasks, dependency upgrades, test backfills) is identified; Copilot Workspace is reasonable if engineering is already on Copilot. None of these is the agent farm — they are tools the engineering team uses, not the runtime the sales agent runs on.

The architectural takeaway worth keeping is that Replit Agent runs on Mastra, which provides one external data point on Mastra’s production-credibility for agent runtimes; and Devin’s existence sets the price ceiling for “autonomous coding agent” at $500/month-plus, which is the comparison number every executive has internalised.

Verdict

Not infrastructure to buy for the agent farm. Cursor Pro at $20/seat for individual engineering productivity is fine; defer the rest. Use the category as the reference set when explaining to non-technical stakeholders what an agent is.

Research-Agent-as-a-Service

3.22 Research-Agent-as-a-Service

What it is

A specific product category that emerged in 2025–2026: hosted multi-step research agents that take a question, browse the web, synthesise sources, and return cited answers. These are the “buy not build” path for any feature where the agent’s job is “go find out about X.” Hestiia’s current sales agent absolutely does competitive research and lead enrichment — today this is hand-rolled web search; the question is whether one of the hosted research APIs is a better answer than building it.

Leading vendors

OpenAI Deep Research API is the API-ified version of ChatGPT’s Deep Research feature — submit a query, the agent runs a multi-turn web-research session, returns a structured report with citations. Pricing is on the order of $10 per 1K tool calls at the API level, with per-token charges for the underlying o1/o3 model on top. The pitch is depth — these are 5-to-30-minute sessions that produce report-quality output, not 5-second search snippets.

Perplexity Sonar API is the lower-cost, faster alternative — Perplexity-trained models with built-in web search, returning answers with inline citations. Pricing is roughly $5 per 1K queries for Sonar, with Sonar Pro and Sonar Reasoning at higher tiers. Latency is in the seconds range, not minutes; output is concise rather than report-length.

You.com API is the legacy player in the AI-search space — a research API with citation-first output, similar shape to Perplexity. Pricing is comparable. Reasonable backup if Perplexity availability or pricing shifts.

Where it fits in a Hestiia-shaped stack

For Hestiia’s lead-enrichment use case — “tell me about this MoA, what projects have they done, who’s the technical decision-maker, what is their RE2020 posture” — Sonar at $5 per 1K queries is almost certainly cheaper than maintaining a custom search-and-summarise loop. The integration is one HTTP call; the maintenance burden is zero. Wrap it as an MCP server (research-managed) and the sales agent gets enrichment for free.

OpenAI Deep Research is overkill for inline enrichment but the right answer for a once-a-week “deep dive on this strategic account” workflow — the depth justifies the cost when the output goes into a sales meeting.

Verdict

Adopt Perplexity Sonar at $5 per 1K queries as the default lead-enrichment backend; reserve OpenAI Deep Research for high-stakes account briefings. Wrap both as MCP servers.

Prompt Management

3.23 Prompt Management

What it is

The category of tools whose first job is “let non-engineers edit prompts safely” — not tracing, not eval, not gateway, but the editor-and-versioning surface for the prompt as a first-class artifact. The reason this matters at Hestiia specifically: the sales-agent prompts encode commercial knowledge that lives in the heads of jens-ptz and the commercial team, not in engineering. Forcing every prompt change through a Git PR is friction; allowing direct edits to a prompts.py file in production is reckless. A prompt management tool sits between those two failure modes.

Leading vendors

PromptLayer is the prompt-management-first tool — UI built for non-technical PMs to edit, version, and roll back prompts, with eval and tracing layered on. Pricing: Free 5K requests, 7-day retention; Pro $50/seat/month (unlimited retention, transaction-based overage on agent runs and eval cells); Enterprise custom with SOC2 Type 2, HIPAA, GDPR, CCPA. The sweet-spot use case is exactly Hestiia’s: the sales team owns the prompts, engineering owns the runtime, and PromptLayer is the boundary.

Humanloop is the same category at the enterprise tier — joined Anthropic in 2025 and is now Anthropic-owned, with the platform repositioned as the prompt-management-and-evals layer for Claude-centric enterprises. Pricing: free trial (10K logs/month, 50 eval runs, 2 members); production tier is Enterprise custom only — VPC deployment, SAML, SLA, no mid-market plan published. Right answer for regulated enterprises; over-spec for Hestiia’s current needs, but the Anthropic ownership is worth flagging — if Hestiia goes deep on Anthropic, a Humanloop conversation eventually lands on the table.

Promptfoo (covered also in 3.11) is the OSS prompt-testing harness — not a prompt-editing UI but the CI-friendly counterpart. The natural pairing is “PromptLayer for editing, Promptfoo for testing in PR.”

The other tools in this neighbourhood — Langfuse’s prompt-management module, Braintrust’s prompt UI, the prompt-management features inside any framework — are real but secondary to the dedicated tools above.

Where it fits in a Hestiia-shaped stack

The honest answer is that today, a prompts/ directory in the agent repo with PR-based edits is fine. The number of prompts is small, the rate of change is moderate, and forcing a Git PR is a feature rather than a bug — it gives engineering visibility into what the sales team is changing. The day this breaks is the day jens-ptz is iterating on the sales pitch three times a week and the engineering team becomes the bottleneck. That is the trigger for adopting PromptLayer or Langfuse’s prompt module as the editing surface, with Promptfoo in CI as the safety net.

Verdict

Defer; revisit when prompt edit velocity exceeds engineering’s capacity to PR-review. Default to Langfuse’s built-in prompt management (already in stack from 3.11) before paying for PromptLayer. Reserve Humanloop for the day Hestiia goes Anthropic-Enterprise-and-regulated.

Personal-Agent Runtimes

3.24 Personal-Agent Runtimes

What it is

Every other chapter in Section B has assumed a single shape for “an agent”: a server-side process, fronted by webhooks or a queue, ephemeral per execution, with the deal or the ticket or the user-id as the addressable unit, writing back through structured APIs. CLAWD-SALES-AGENT is exactly that shape — a webhook fires from Pipedrive, an agent run boots, it reads the deal, drafts a structured update, exits. The agent has no continuous identity; it has 1,000 deals a day, five events per deal per day, and a change_source anti-loop guard so it does not echo its own writes.

There is a second shape that the book has so far ignored, and as of 2026 it is the most-starred shape on GitHub. A personal-agent runtime is a long-lived daemon process running on a single user’s machine (or a single user’s container), holding a stable identity across sessions, fronted by chat-style messaging interfaces rather than webhooks, with skills loaded from a local workspace. The mental model is the assistant that lives on your laptop and answers your DMs — one user, one identity, persistent context, conversational front door. It is not a server-side agent farm wearing a different hat; it is a structurally different system, with different concurrency assumptions, different durability assumptions, and different addressable units. The category matters at CTO level for two reasons: it is where the bulk of the open-source momentum has gone in late 2025 and 2026, and it is the right shape for any future MyEko Pro end-user assistant — the in-app household concierge the book bookmarks in §3.15 as the eventual trigger for Composio adoption. It is the wrong shape, and importantly so, for the sales-agent farm.

Leading vendors

OpenClaw (openclaw/openclaw, MIT, TypeScript, 365.8K stars and 75.0K forks as of 2026-04-28) is the category-defining project and, as of writing, the most-starred AI-agent project on GitHub by a wide margin — past AutoGPT, past OpenHands, past every commercial CLI’s open-source twin. It was released in November 2025 by Peter Steinberger, the ex-PSPDFKit founder, under the original name “Clawdbot.” Anthropic filed a trademark complaint over the “Clawd” string and on 2026-01-27 the project was renamed to “Moltbot”; three days later it became “OpenClaw” because Steinberger said in his rename announcement that “Moltbot never quite rolled off the tongue.” On 2026-02-14 Steinberger announced he was joining OpenAI, and stewardship passed to a non-profit foundation that now governs the project. (Verified against Wikipedia, the GitHub API, CNBC, TrendingTopics, and Laravel News; the rename chronology is the most-checked sequence in the entire chapter.)

What OpenClaw actually is, mechanically, is a daemon process for macOS, Linux, and Windows-via-WSL2. Install is a single line — npm install -g openclaw && openclaw onboard --install-daemon — and once installed it runs as a background service that hosts a stable identity and answers messages. The front door is not a web UI; it is whatever chat product the user already lives in. Out of the box it speaks Signal, Telegram, Discord, WhatsApp, and Slack. Everything else about the runtime follows from that: there is no session, only a continuous conversation; there is no per-request authentication, only the user’s existing chat-app identity; there is no UI to maintain, because the UI is somebody else’s app. The skill system is directories containing a SKILL.md file plus whatever scripts and assets the skill needs, and skills can be bundled with the daemon, globally installed at ~/.openclaw/skills/, or workspace-local. Critically, the skill format is explicitly compatible with Anthropic’s Agent Skills standard from agentskills.io — same SKILL.md shape, same frontmatter conventions, deliberate cross-compatibility, so a skill written for Claude Code drops into OpenClaw and vice versa. On every wake — every incoming message — the daemon reads three Markdown files into the system prompt: AGENTS.md for operational instructions, SOUL.md for identity, personality, and style, and TOOLS.md for the tool inventory. The LLM is supplied externally via API key; OpenClaw ships no inference, has no built-in model, and is loudly model-agnostic in its docs (Anthropic, OpenAI, local Ollama, all routed through a thin provider layer). It is an orchestration shell, not a model — which is exactly the right framing, because it means OpenClaw does not have to win the model war to win the runtime war.

Claworc (gluk-w/claworc, Apache-2.0, Go, 222 stars, 29 forks, created 2026-02-06) is the community-built multi-instance manager around OpenClaw. It is a single-maintainer project, modest by star count, but it is the production-shaped answer to “I want my whole household, or my whole small team, to each have their own OpenClaw without each person manually managing a daemon.” Claworc wraps each OpenClaw instance in its own Docker container, exposes a web UI for create/stop/logs, and proxies traffic through a single authenticated entrypoint with SSH-over-ED25519 between the control plane and the per-user instances. Production deployment is Kubernetes-shaped; local-or-single-server is plain Docker. The multi-tenancy is structurally important to understand: Claworc gives you many users each running one OpenClaw, not one OpenClaw handling many concurrent business events. A six-person family with a Claworc box in the basement gives every member their own private personal agent; a sales team of fifty with a Claworc cluster gives every salesperson their own assistant. It does not give you a single shared agent processing 1,000 deals/day on behalf of the company. The shape stays the same — one user, one identity, persistent context — and Claworc just makes it easier to run N of them.

NemoClaw (NVIDIA/NemoClaw, NVIDIA-organization-owned, 19.9K stars, created 2026-03-15, alpha) is NVIDIA’s official enterprise-readiness layer over OpenClaw. It was announced at GTC on 2026-03-16, where Jensen Huang quoted OpenClaw as “the fastest-growing open-source project in history.” NemoClaw runs OpenClaw inside NVIDIA’s OpenShell sandbox, supplies NVIDIA Nemotron as the local-model option for environments that need to keep inference on-prem, ships NeMo Guardrails as the safety layer, and includes a privacy router that lets cloud frontier-model calls go to Anthropic or OpenAI for the cases where local models are not strong enough. The pitch is that NemoClaw is what an enterprise IT team can actually deploy: bounded, sandboxed, with a defensible answer to “what stops this thing from leaking.” It is alpha, NVIDIA-owned, and likely to stay close to NVIDIA’s hardware orbit; whether it becomes the dominant enterprise distribution or remains a reference design is a 2026-second-half question.

AutoGPT (Significant-Gravitas/AutoGPT, MIT/Polyform Shield split, Python, 184K stars) is the original autonomous-agent project — the one that defined the public idea of “an agent” in 2023 and held the most-starred-AI-agent crown until OpenClaw passed it in early 2026. It is still alive, but pivoted: the canonical AutoGPT today is the AutoGPT Platform, available as both a hosted cloud product and a self-hosted Docker stack. The personal-agent shape AutoGPT helped invent has fragmented across the projects that succeeded it, and the modern AutoGPT looks more like a low-code agent builder than a daemon. Worth naming for category completeness; not Hestiia’s pick.

Open Interpreter (OpenInterpreter/open-interpreter, AGPL-3.0, Python, 63.3K stars) is the desktop-agent reference. The project’s calling card is the --os flag, which is its computer-use mode — same idea as §3.19’s Anthropic Computer Use, but bolted onto a local-first agent runtime that has been iterating on the surface since late 2023. As of 2026 it is the most-mature open-source computer-use surface on the personal-agent side, and the right benchmark to compare any local computer-use story against. Like OpenClaw, it lives in the personal-agent shape — one user, local daemon — not the server-side shape, and the AGPL licence is a real constraint for any Hestiia-internal redistribution path.

Aeon (the Claude-Code-on-cron pattern, project Aaron Mars) is the engineer’s personal-agent runtime — no daemon, no UI, no messaging interface. The pattern is claude -p running on GitHub Actions on a cron schedule, with skills and prompts checked into a repo, and outputs landing as commits, issues, or Slack notifications. It is the OSS background-intelligence shape for engineers who already live in claude -p: the agent does not wait for a message, it wakes up on schedule, reads the current state of the world from whatever sources you give it, decides whether to do anything, and goes back to sleep. Aeon is real but smaller than the projects above and the public footprint is thinner; the architectural shape is what matters more than the specific repo. It belongs in the same conceptual category as OpenClaw — personal, identity-rooted, skill-driven — but with a fundamentally different front door (cron, not chat) and a fundamentally different state model (Git, not a daemon’s in-memory context plus SOUL.md).

The Anthropic Skills + claude -p pattern itself is, structurally, Anthropic’s own personal-agent runtime. It is just not packaged as a product — Anthropic distributes it as a developer tool. CLAWD-SALES-AGENT is built on top of it, which is precisely why this chapter has to be careful about the shape distinction: the building blocks OpenClaw and CLAWD-SALES-AGENT use are the same (Anthropic Skills, the SKILL.md shape, model calls through the Anthropic API), but the runtime topology is different. claude -p running ephemerally inside a webhook handler is the server-side shape; claude -p running inside a long-lived daemon fronted by Signal is the personal-agent shape. The same primitive, two incompatible deployments.

SOUL.md (aaronjmars/soul.md, MIT, ~380 stars) is the community standard for the identity, personality, and style file that personal-agent runtimes load on every wake. It is cross-compatible with Claude Code, OpenClaw, and Aeon — the same Markdown file describing who the agent is, what its tone is, what it cares about, what it refuses to do, and what its long-term memory of the user looks like. SOUL.md is small as a project (a spec and some example files, not a runtime) but it is the missing piece that lets the personal-agent shape work at all: without a stable identity that survives restarts, a long-lived daemon is just a chat client with extra steps. Worth naming because SOUL.md plus AGENTS.md plus TOOLS.md is the de facto three-file convention emerging across this entire category.

Where it fits in a Hestiia-shaped stack

The single most important thing for a Hestiia CTO to internalise about this chapter is that personal-agent runtimes and server-side agent farms are different shapes, and OpenClaw and CLAWD-SALES-AGENT are siblings sharing Anthropic Skills DNA, not parent and child. They overlap on three things and only three things: both have a skill system shaped like Anthropic’s SKILL.md, both are Claude-loyal in practice, and both treat identity as a Markdown file. Everything else is incompatible.

OpenClaw assumes one user, persistent identity, messaging-fronted, local-first daemon, skills loaded from a workspace. CLAWD-SALES-AGENT assumes zero users in the loop, ephemeral per-webhook execution, server-side, deal-centric (the deal is the addressable unit, not a person), and structured Pipedrive writes rather than chat replies. The concurrency pattern is “1,000 deals/day, 5 events/deal/day, anti-loop on change_source.” OpenClaw has no anti-loop because it has no notion of writing back to the same surface that fired the event; it has no notion of deal_id as an addressable concurrency key, because its addressable unit is the user; and it has no durable workflow primitive (the daemon dies, state is lost beyond what is persisted in SOUL.md and the chat history of whatever messaging app fronts it). Trying to retrofit OpenClaw onto the CLAWD-SALES-AGENT use case would mean rewriting the OpenClaw daemon to be webhook-fronted, durable across restarts, deal-keyed for concurrency, and capable of structured writes back to Pipedrive without echo. That is not a fork; it is a different system that happens to read the same SKILL.md files.

The right Hestiia move, today, is to skip OpenClaw for the sales-agent farm. CLAWD-SALES-AGENT is correctly the server-side shape, correctly built on claude -p as a webhook-driven runtime, correctly paired with the durable execution and observability picks from §3.10 and §3.11, and correctly addressable by deal_id rather than by user. None of the mechanisms personal-agent runtimes optimise for — persistent identity, conversational front door, long-lived daemon — earn their keep in that workload.

However, the moment Hestiia ships a personal-agent product to MyEko Pro end-users — the in-app household assistant that §3.15 already bookmarks as the future trigger for Composio adoption, the “voice agent for the heater” use case where every household has its own agent that knows its own people, its own routines, its own thermal preferences — OpenClaw becomes a real candidate runtime. The shape match is exact: one identity per household, persistent across sessions, fronted by whatever messaging surface the customer prefers (the MyEko app’s chat, or Telegram, or WhatsApp), with skills loaded per household and Composio handling the per-end-user OAuth into Google Calendar, Home Assistant, and the broader smart-home universe. NemoClaw becomes interesting at that point as the sandboxed-and-guardrailed distribution; Claworc becomes interesting as the multi-tenant manager for “many households, each with one OpenClaw” if Hestiia wants to host these centrally rather than running one per device.

The honest read on Aeon is that for engineers who already live in claude -p it is the lowest-friction personal-agent pattern available — schedule a job, point it at a repo of skills, let it wake up on cron and act. For Hestiia internally, that pattern would land naturally as the substrate for individual engineering productivity agents (the “every morning, scan yesterday’s PRs and summarise blockers in #engineering” kind of workflow). Worth a half-day spike per engineer who wants one; not a stack-level commitment.

Verdict

OpenClaw is the most-starred OSS agent project on the planet and the category-defining personal-agent runtime, and it deserves naming on those grounds alone. It is also not the right runtime for CLAWD-SALES-AGENT, because the shape does not fit — sales-agent farms are server-side, deal-keyed, ephemeral, and webhook-fronted, and personal-agent runtimes are none of those things. Skip OpenClaw for the sales-agent farm; bookmark it, with NemoClaw and Claworc, as the recommended runtime stack for the MyEko Pro household-assistant product when that becomes a roadmap line item. Aeon is the right pattern for individual engineer-productivity agents the day someone has the bandwidth to build one. SOUL.md belongs in the team’s vocabulary regardless — once Hestiia ships any agent that needs a stable identity (the household assistant, a customer-support agent, an internal personal assistant for the CEO), the three-file convention of AGENTS.md plus SOUL.md plus TOOLS.md is the lowest-friction path from prototype to production.


Part IV — Decision Frameworks

Parts II and III are descriptive: they tell you what exists and how the categories relate. Part IV is prescriptive: it tells you how to decide. The four chapters here translate the architectural axes into decision questions, catalogue the most expensive ways production teams have been wrong in 2024–2026, list the early-warning signs that a stack choice will not survive a year, and project where the market is heading by 2028 so that today’s decisions are robust to tomorrow’s evidence.

Restating the Five Axes as Decision Questions

4.1 Restating the Five Axes as Decision Questions

The five-axis model in §1.4 is a vocabulary, not a procedure. Translated to decisions, the axes look like this.

1. Runtime: what code do we own that the framework cannot give us back?

Every runtime makes you write some code. Mastra wants you to write Workflow and Step definitions. LangGraph wants StateGraph nodes and conditional edges. PydanticAI wants Agent classes and tool functions. The Anthropic Agent SDK wants Skills and tool handlers. The decision question is not “which syntax do we like” but “if this framework dies in nine months, how much of what we wrote walks away with it?”

The honest answer for most frameworks in 2026 is most of the workflow code, none of the prompts, none of the agent decomposition. That is the right ratio. If a framework wants you to write more than that — to embed business logic in its DSL, to express tools through its abstraction, to bind eval criteria to its primitives — you are buying a platform under a library’s marketing. Decide that with eyes open. The watch-out is the framework that quietly graduates from “library” to “platform” between major versions; LangChain did exactly this in 2023–2024, and CrewAI is mid-transition in 2026.

2. Durability: where does our state live, and what breaks if that store has a 30-minute outage?

The framework you pick will have opinions about durability. Mastra defaults to its own storage adapters but supports Postgres. LangGraph offers a Postgres checkpointer with a default JSON serializer. PydanticAI has no opinion — you wire your own. Cloudflare Agents bakes state into Durable Objects with embedded SQLite. The Anthropic Managed Agents service holds state on Anthropic’s infrastructure and charges by session-hour.

The decision question reduces to two: where does the state physically live, and who owns the SLA on that store? Self-hosted Postgres is the most defensible answer because every operator on the team understands Postgres. Vendor-managed durability (Managed Agents, AgentCore, Cloudflare Durable Objects) trades operational simplicity for vendor SLA exposure — a regional outage at the vendor is your outage too, and your runbook cannot fix it. Hybrid options like DBOS (durability as a Postgres extension) and Inngest (durability as an event log over your existing infra) split the difference. The watch-out: discover the durability layer’s failure modes during the bake-off, not in production. Pull the network plug on the durability backend and watch what the agent does.

3. Observability and eval: can on-call answer ‘what did this agent do at 03:14 last Tuesday?’ in under 90 seconds?

This is the single most under-budgeted axis in 2026 production agent farms. Teams ship the agent, ship the logs, decide observability is “logs plus the LangSmith dashboard” and discover at month four that triaging a customer complaint requires correlating IDs across three systems by hand.

The decision question is operational. Sit a senior engineer down with a synthetic incident — an agent run that produced the wrong Pipedrive note — and time how long it takes them to find the run, see every model call, see every tool result, and cite the root-cause prompt or tool failure. If the answer is more than ninety seconds, the observability stack is wrong. The watch-out: vendor pricing on traces escalates non-linearly. LangSmith’s published price is roughly $39 per seat per month plus per-trace overage; the bill at 50,000 runs per day with 40 spans per run is materially different from the bill at 5,000 runs per day with 5 spans per run. Load-test the trace pipeline at 5x your expected volume before signing.

4. Gateway: do we need to swap models tomorrow without redeploying?

For most internal agent farms in 2026, the answer is no. You picked Claude (or GPT-5, or Gemini 2.5) because that model is best at the routing decision your agent makes. Swapping is a quarterly decision, not a daily one. A model gateway adds a real network hop, a real new failure mode, and real new spend. If you have one team, one model, one provider, you do not need a gateway today.

You will need a gateway later if any of the following become true: you must support multiple models per agent for cost reasons, you have multiple teams with separate billing, you need cross-provider fallback for SLA, or you need to A/B-test model choices against an eval harness. At that point LiteLLM (free OSS, $200/month managed) or Portkey (Pro $99 per user per month) are the obvious candidates. The watch-out: a gateway is not free in latency terms. The first minute you spend debugging a 200ms tail-latency increase that the gateway introduced is the moment the gateway has to start earning its keep.

5. Integration: which of our integrations has an MCP server, and how do we cover the rest?

Integrations are no longer a framework concern in 2026. They are an MCP concern. The decision question is purely inventory: list the systems your agent must talk to (Pipedrive, Slack, Gmail, Google Calendar, the company wiki, the customer database). For each, identify whether there is a first-party MCP server, a community MCP server, a managed MCP server (Composio, Arcade, Pipedream Connect host these for you at $29–$229/month), or no MCP server at all. The systems with no MCP server fall into one of three buckets: write your own with FastMCP (a few hours of work for an internal API), expose them via a managed runtime (Composio for breadth), or — the genuine last resort — Computer Use or a headless browser via Browserbase.

The watch-out: every framework today claims to be MCP-native. Most are MCP-compatible-with-caveats, lagging the spec by one to three months. Pin yourself to a framework whose MCP client implementation tracks the 2025-11-25 spec, especially CIMD, Resource Indicators, and Tasks. Today that means the Anthropic SDK, the OpenAI Agents SDK, mcp-use, and current LangChain.

The Disaster Scenario Catalog

4.2 The Disaster Scenario Catalog

Across the twenty parallel research dives that informed this book, six failure modes appeared with enough regularity that they constitute a pattern catalogue. Every framework you evaluate will surface at least one of these by month twelve. Knowing which one is half the work.

Pattern 1: Framework EOL or pivot. A framework that was perfect at adoption decides — usually around the Series B fundraising cycle — that the OSS-friendly path is no longer the business priority. The “Mastra Cloud” pattern: a managed offering ships, OSS storage adapters quietly deprecate, your self-hosted path requires maintaining a fork. The CrewAI pattern: the framework’s centre of gravity drifts from individual developers to enterprise-procurement deals, and the OSS roadmap slows. The LangChain pattern: the company pivots from “framework for everything” to “we are really an agent platform now,” twice. The mitigation is to pick frameworks whose abstractions are negative — frameworks you can leave by deleting code, not by reimplementing a translation layer.

Pattern 2: Pricing surprise. Trace overages are the most common (LangSmith at $2.50 per thousand traces over the included tier; spans-per-run is the multiplier most teams underestimate). Session-hour billing is the second most common (Anthropic Managed Agents at $0.08 per session-hour; idle agents waiting on tool responses are billable). Per-step-execution pricing is the third (Inngest’s pricing model multiplies cleanly with deep agent loops). Bedrock’s OpenSearch Serverless minimum (2 OCU at $0.24 per OCU per hour, around $350 per month) bites teams that thought they were “AWS-native and free.” The mitigation: at week eight, multiply your three-month bill by your projected scale-up factor and stare at the number.

Pattern 3: Vendor lock-in via hosted features. The OpenAI Agents SDK with file_search ties your IP — your prompts, your reference data, your structured outputs — to OpenAI’s vector store. Bedrock Agents’ action groups bind your tool definitions to OpenAPI-plus-Lambda. The Anthropic Managed Agents runtime assumes Anthropic-hosted execution. None of these are wrong choices in isolation; they are wrong choices if you adopted them without understanding that the migration cost on month twelve is six to ten weeks. The mitigation: every quarter, ask “if our model provider became unacceptable for any reason, how long would it take to leave?” If the answer is more than two weeks, the lock-in is real.

Pattern 4: Abstraction leak. The LangGraph BaseMessage serialization breakage that arrives the week Anthropic ships a new content-block type. The Mastra workflow that resumes twice after a Node restart because the suspend mechanism has a subtle interaction with the retry policy. The Inngest step-execution that fails after max retries with an error buried three layers deep in their dashboard. Every framework leaks somewhere. The mitigation: read the framework’s GitHub issues before committing — the open bugs that have been open for three months tell you exactly where the abstraction leaks live.

Pattern 5: Performance cliff at 100–500 concurrent runs. Several frameworks share a remarkably consistent ceiling. LangGraph’s Postgres checkpointer becomes the bottleneck around 100 concurrent graph executions with default JSON serialization. Mastra’s eval/tracing pipeline drops spans around 200 concurrent runs. Bedrock Agents has 50 concurrent invocations per region by default. The OpenAI Agents SDK hits undocumented org-level rate limits around 500 concurrent runs with hosted tools. The mitigation: load-test at 5x your expected production volume during the bake-off, not after launch.

Pattern 6: Maintainer departure. A framework with one or two load-bearing maintainers is one resignation away from unmaintained. Atomic Agents, Burr, and several minor frameworks fall into this risk class. Even larger frameworks have specific subsystems with bus factor of one — the runtime person at Mastra, the durability person at Inngest. The mitigation: read commit history before adopting. Frameworks with thirty contributors in the last quarter survive a single departure. Frameworks with three do not.

Early Warning Signs (Days 1–90)

4.3 Early Warning Signs (Days 1–90)

Regardless of which framework you picked, the following signals in the first ninety days tell you the bet was wrong. Watch all of them. Two or more is a red flag. Four or more is an exit signal.

Week 2–4: the “fighting the framework” count. Track the times your senior engineer says “we would just write this in raw code if we were not using the framework.” More than once a week means the abstraction is fighting your problem.

Week 4–6: documentation drift. Open the framework’s docs and try to do something the docs claim is supported. If you find three or more places where the docs are out of date, the framework is shipping faster than its surface area can handle. This always gets worse, never better.

Week 4–8: the escape hatch test. Ask your team: “if we had to leave this framework in six months, what would it cost?” If the answer is “we would basically rewrite,” you bought a platform, not a library.

Week 6–8: the observability gap. Can on-call, given a customer complaint timestamp, find the agent run, see every tool call, see every model response, and determine root cause in under ten minutes? If the answer requires opening three dashboards or correlating IDs by hand, the observability is not ready. Ten minutes is the budget. Most teams need ninety seconds.

Week 8–10: the rate-limit ambush. Run a load test at 5x your expected production volume. The actual limits live in your inference provider’s account dashboard, your framework vendor’s pricing page, and undocumented per-org caps you discover by hitting them. Discovering them at 5x production is cheap. Discovering them at 1x production is an outage.

Week 8–12: the model-swap drill. Pick the agent that costs you the most in inference. Try to swap it from Claude to GPT-5 (or vice versa). If this takes more than two days, you are locked in. If it cannot be done at all, you bought a hosted-tool dependency, not a framework.

Week 10–12: framework velocity vs your velocity. Look at the framework’s GitHub. How many commits this week? How many maintainers? Is the lead maintainer still committing? If the bus factor is one, your framework is one resignation away from unmaintained.

Week 12: the bill. Take the month-three invoice (inference, framework, observability, supporting infra) and multiply by your projected scale-up factor. If the answer is “we cannot afford this at scale,” your unit economics are wrong now and they will not get less wrong with volume. The fix is structural — a different model, a different runtime, a different observability tier — not a tighter prompt.

The meta-signal. Ask each engineer privately, “would you choose this framework again?” If by week eight the answer from your two best engineers is “no, we would choose X” — believe them. They have built the most code and they know where the bodies are. The cost of switching at week eight is two weeks. At month twelve, it is a quarter. The decision to switch is almost always correct; the decision to wait is almost always expensive.

The 2027–2028 Trajectory

4.4 The 2027–2028 Trajectory

Decisions made in April 2026 should be robust to where the market is going, not just where it is. Four convergence theses are competing for the future of agent infrastructure.

Thesis (a): a few dominant frameworks win. LangGraph plus Mastra plus maybe CrewAI consolidate; everyone else fades.

Thesis (b): hyperscaler-hosted standardization wins. Bedrock AgentCore, Azure Foundry Agent Service, and Vertex AI Agent Builder eat the orchestration layer.

Thesis (c): MCP becomes the protocol layer and frameworks become irrelevant. The integration layer commoditizes; frameworks become thin wrappers.

Thesis (d): durable execution wins and agent code shrinks. The runtime is the product; “agent code” reduces to a few hundred lines on top of a durable substrate.

The evidence in 2026 strongly favours (c) and (d) compounding, with (a) and (b) winning specific segments rather than the whole market. The MCP signals are conclusive: monthly SDK downloads went from 100K at launch in late 2024 to roughly 97M by March 2026, with over 10,000 active public servers; Anthropic donated MCP to a Linux Foundation directed fund in December 2025 with Block, OpenAI, Google, Microsoft, AWS, and Cloudflare as supporters. When the four hyperscalers and the two leading model providers all back the same protocol, that protocol stops being optional.

Durable execution evidence is similarly conclusive. Temporal raised $300M at a $5B valuation in February 2026 with 1.86 trillion AI-native workflow actions on its cloud. Inngest shipped Temporal-compatible workflows in February 2026; Trigger.dev raised $16M Series A explicitly positioning as durable agents; Cloudflare’s Project Think added durable execution with fibers to the Agents SDK in April 2026. The runtime is genuinely becoming the product.

Three concrete shifts follow.

Shift 1: MCP changes the integration economics. By 2027, every major B2B SaaS will ship a first-party MCP server. The “we have 200 tool integrations” value proposition that frameworks like LangChain and CrewAI built their first two years on disappears. The remaining framework value is orchestration semantics, durable state, and eval integration. Frameworks without those three become wrappers around MCP and get squeezed.

Shift 2: The compute-provider-runs-the-agent vs agent-code-runs-anywhere split. Cloudflare Agents, Anthropic Managed Agents, Bedrock AgentCore, and OpenAI Agents-as-a-Service represent one architecture: the provider owns the runtime. Mastra, LangGraph, PydanticAI, CrewAI, and Strands represent the other: you own the runtime. Both win, on different segments. The losing position is the middle — frameworks that require you to run their infra but do not run it for you.

Shift 3: Eval-as-code becomes table stakes. In 2026, eval is a “best practice.” By 2028, it is mandatory in every serious agent deployment. Reliable agents need three eval layers: unit evals on discrete steps, LLM-as-judge regression suites, and continuous production trace sampling. By 2028, a “no-eval-pipeline” agent shop will be considered as professionally negligent as a “no-CI” software shop is today.

The 2026–2028 framework casualty list. AutoGen is effectively abandoned by Microsoft for the Foundry Agent Service. Standalone vector-DB-centric “RAG frameworks” are extinct. Pure prompt-orchestration frameworks with no durable runtime story do not survive. Most of the 120+ named agent frameworks in 2026 are gone or absorbed by 2028. LangChain itself — the original library, not LangGraph — is being quietly rewritten out of production stacks.

The 2026–2028 rise list. Durable execution runtimes (Temporal, Restate, Inngest, Trigger, Cloudflare Durable Objects) as the substrate every framework runs on. Eval platforms (Braintrust, Langfuse, LangSmith) as table-stakes CI infrastructure. MCP gateways and registries as the “API gateway of the agent era,” a $1B+ market by 2028. Vertical agent products that ship a complete workflow rather than a framework. Compute-provider-runs-the-agent platforms (Cloudflare, Vercel) for greenfield apps.

What this means for 2026 decisions. The robust 2026 stack does not bet on framework dominance. It bets on the substrate beneath the framework. Three commitments survive all four 2028 worlds:

  1. MCP as the integration layer, not framework-specific tool adapters. This decouples the integration surface from any framework choice.
  2. Durable execution as the runtime substrate, regardless of which framework spells the loop.
  3. Eval-as-code from day one, on a credible platform (Braintrust, Langfuse, or LangSmith).

Frameworks change every eighteen months. Substrate decisions made in 2026 should still be defensible in 2028. The decision frameworks in this chapter are designed to keep that distinction front of mind.


Part V — The Hestiia Recommendation

The previous four parts described the world. This part is opinionated and specific. It addresses Hestiia, in April 2026, with the existing CLAWD-SALES-AGENT in production, an Anthropic Team plan as the current billing relationship, an MCP-first integration discipline already established (pipedrive-managed, internal MCP Manager), Postgres in production via the NestJS backend, and a CTO who needs the answer to be defensible to both engineering and the board.

There is a recommended stack, an alternative stack for the case where the recommended one’s premise breaks, a migration sequence, a residual-risk register, and a 90-day plan. Each is concrete enough to argue with.

The CLAWD-SALES-AGENT Inheritance

5.1 The CLAWD-SALES-AGENT Inheritance

Before recommending anything, the inheritance has to be named. The existing CLAWD-SALES-AGENT is not a failed prototype to be replaced. It is a successful prototype whose plumbing is wrong and whose IP is right, and the IP is what survives.

What works in CLAWD-SALES-AGENT, and must be preserved. The four-step pipeline shape (triage with Haiku, context bootstrap, orchestrator with Sonnet, insight update) is the right decomposition for the sales-followup problem. The Haiku-then-Sonnet routing decision, where the cheap model decides whether the expensive model should be called at all, cuts inference cost by roughly 80% versus running Sonnet on every event and is a pattern worth treating as architectural. The insight database is the proto-version of long-term agent memory done as a hand-rolled Postgres-shaped store; the schema is right even if the implementation is naïve. The change_source anti-loop discipline — every Pipedrive write tagged with a source identifier so the agent does not react to its own writes — is non-obvious and load-bearing. The 668 tests are the encoding of “what good looks like” in an executable form, and any rewrite that abandons them resets the clock on production confidence by months. The Slack Block Kit reporting style, the debounce-with-merge logic, the operational learnings about Pipedrive’s webhook delivery semantics — all IP, all worth keeping.

The internal MCP Manager is the second piece of CLAWD-SALES-AGENT-adjacent IP that survives, and the book has so far underweighted it by an order of magnitude. It is not “plumbing.” It is a 30,000-line TypeScript Electron tray app (four pnpm packages, twenty-five minor releases between March and April 2026, public landing page at mcp-manager.hestiia.com) that runs a localhost daemon on 127.0.0.1:3100 proxying every Claude Code tool call through a six-stage middleware pipeline of permission resolution, immutable SQLite audit, native approval popup with five-minute timeout, token-bucket rate limiting, and execution. It ships ten typed first-party connectors — Pipedrive, Gmail, Notion, Webflow, Pennylane, Payfit, Mender, Timestream, AWS IoT Core, Hestiia Manufacturing — and a remote-proxy catalog with full OAuth 2.1 plus CIMD plus PKCE for GitHub, Notion, and ClickUp. The category it occupies is not on the §3.15 vendor map: it is a local-first MCP gateway for the single-developer-per-machine case, the local analogue of Composio (Hobby $29/month, Business $229/month) and Arcade (Growth $25/month) for the use case where the agent is internal and the operator is an employee, not an end user. Its closest commercial neighbour is Docker MCP Toolkit, which offers neither typed connectors nor the Hestiia-internal Timestream and IoT Core tools and which would require the typed-connector layer to be rebuilt anyway. The pragmatic posture is to treat the typed-connector layer and the policy-and-audit pipeline as IP worth maintaining, and to treat the eventual end-user-OAuth use case (the day MyEko Pro ships an in-app assistant) as the explicit point at which Composio or Arcade displaces this layer for the customer-facing agent farm — not before.

What does not work in CLAWD-SALES-AGENT. The plumbing, on its own terms. The argument is structural rather than throughput-based, and the migration case does not depend on event volume going up. There is no durable execution layer: the four-stage pipeline runs as in-memory function calls in a FastAPI worker, and a pod restart mid-pipeline drops state — masked today by Pipedrive’s webhook retries and the rarity of restarts, but a correctness bug at any volume rather than a scale bug. claude -p is invoked as a subprocess per event, so there is no shared in-process state with the worker, no streaming surface for tracing, and no fixture-injection path that does not stub the entire subprocess. SQLite is doing duty as event queue, IPC bus, debounce store, and insight database simultaneously, which is the right substrate for the last of those and the wrong substrate for the first three. The trace surface is a Svelte dashboard reading the same SQLite tables the pipeline writes to, which couples the operational view to the schema and offers no filter-by-deal-ID, no cost-per-run view, and no replay. There is no eval pipeline, so regressions in agent behaviour — the dimension the 668 unit tests do not cover — are caught when a customer notices. And the Anthropic Team-plan billing, useful as a prototyping shortcut, is explicitly forbidden by Anthropic’s terms of service for shipped products and is not a production path. None of this is a comment on the engineer; it is what a one-person prototype shipped under deadline in 2025 Python looks like when the only success criterion that mattered was “does it work end-to-end,” and by that criterion it succeeded. The IP is real and it survives the migration. The plumbing is what gets replaced.

The reimplementation question is, therefore, not “what do we replace CLAWD-SALES-AGENT with” but “what plumbing do we move CLAWD-SALES-AGENT’s IP onto, and how do we do it without throwing away the IP in the process.” That framing collapses the search space dramatically.

The Recommended Stack

The recommended stack is the one that minimises the migration cost from the current shape, maximises the portability of the IP, and survives the four 2028 worlds described in §4.4. Every element below is justified against those three criteria.

Runtime: the Anthropic Agent SDK (Python) on API-key billing. The existing CLAWD-SALES-AGENT shape — claude -p plus Skills plus the Agent tool plus Pipedrive MCP — is precisely what the Anthropic Agent SDK is the production version of. The SDK loads the same SKILL.md files, the same .claude/agents/*.md definitions, the same .mcp.json, and the same hooks. The migration from claude -p subprocess to SDK in-process is roughly 1.5–2 engineer-weeks because the artifacts do not change shape; only the execution boundary does. The lock-in profile is acceptable: SDK plus Skills plus MCP is roughly 70% portable to another runtime, and the prompts and agent decompositions — the actual IP — port cleanly.

Durability: DBOS with the Conductor control plane. Postgres is already in production via the NestJS backend. DBOS is durable execution as a Python library, decorating workflow and step functions, recording every step into the same Postgres in the same transaction as the business writes. No new cluster, no new broker, no new deployment unit. The @DBOS.workflow decorator wraps the four-step pipeline; the @DBOS.step decorator wraps each individual model call and tool call. A pod crash mid-pipeline resumes on a fresh worker exactly where it left off, with exactly-once semantics for each step. The Conductor control plane at $99/month Pro provides the recovery UI, alerting, and versioning that the bare library lacks. DBOS ships first-class wrappers for the OpenAI Agents SDK, PydanticAI, and LangChain; an Anthropic SDK wrapper does not exist as a first-class adapter today, but the integration is straightforward — the durable boundary is step.run, and Anthropic SDK calls slot in as steps.

Observability and eval: Langfuse self-hosted, instrumented via OpenInference, with Inspect AI in CI. Langfuse self-hosted on a small ECS service with ClickHouse and Postgres costs roughly $1,200 per year of infrastructure and produces every trace, every prompt version, every dataset, and every cost report on Hestiia-owned infrastructure with no vendor pricing surprises. OpenInference is the OTel-aligned instrumentation library that emits GenAI semconv-compliant spans, which means the same instrumentation works against Phoenix (free local debug), Datadog (if Datadog ever displaces Langfuse), Logfire (the no-Pydantic-needed escape hatch), or Honeycomb. Inspect AI, the UK AISI’s MIT-licensed eval framework, runs in CI on every prompt or model change, gating PRs on regression-set scoring. Promptfoo is the lightweight prompt-diff harness layered alongside. Total observability and eval cost: roughly $1,500 per year plus an engineer-week of setup.

Integration: keep the MCP-first discipline, build new servers on FastMCP 3. The existing pipedrive-managed plus internal MCP Manager architecture is correct and stays. Move the Pipedrive MCP from a stdio subprocess into an in-process @tool wrapper using create_sdk_mcp_server — fifty lines of Python that eliminates the subprocess boundary while preserving the MCP semantics for any future framework migration. New internal servers (Slack actions beyond what the SDK provides, the Hestiia-internal customer database, any future tool exposed to agents) are built on FastMCP 3 with OTel instrumentation from day one. Composio, Arcade, and Pipedream Connect are not adopted today; they wait for the use case where end-user OAuth at SaaS-breadth becomes the binding constraint, which is the day MyEko Pro ships an in-app assistant.

Gateway: none, with a one-file abstraction. A 20-person company on a single provider with five deployed agents does not need a model gateway. What it needs is the cheap abstraction at the code layer so that adding a gateway later is not a two-week refactor. One Python file, three functions: complete(), complete_streaming(), complete_with_tools(). All three call the Anthropic SDK directly today. The day Hestiia introduces a second provider, has more than five deployed agents needing shared budgets, or faces a SOC2 audit asking for centralised audit logs, replace the three functions with calls to LiteLLM proxy and add the proxy as one more service on ECS. Estimated migration cost: one engineer-day, because the application code never sees the difference.

Memory: defer; pgvector on RDS plus a hand-rolled summary loop covers current need. The insight database is already a proto-semantic-memory store. Until deal volume or per-deal interaction volume crosses the threshold where the prompt-stuffing approach breaks, the dedicated memory category does not earn its keep. When the threshold arrives — likely 2027 — Zep Cloud Starter at $39/month is the strongest commercial pick because temporal knowledge graphs are the right shape for “what was true about this MoA last quarter that may not be true now.” Letta self-hosted is the OSS escape hatch.

Sandboxes, browser automation, computer use, voice: defer with named defaults. The sales agent does not need any of these today. The named defaults for the day they become real: E2B at $150/month for code sandboxes when a code-emitting agent ships; Browserbase plus Stagehand at $99–$499/month for the French-admin-portal workflow; Anthropic Computer Use for legacy desktop fallback (token-cost only); Vapi for voice prototypes when an inbound-callback or first-touch-outbound use case gets prioritised. All four are wrapped as internal MCP servers when they are introduced, so the runtime layer never sees the difference.

Engineering productivity: Cursor Pro at $20/seat for the engineering team. Not part of the agent farm, but worth naming as the rational individual-license spend that should land on the same procurement decision. Devin at $500/month for the team is harder to justify until a specific autonomous-coding use case is identified.

The total infrastructure budget for the agent farm under this stack, at the “Real” volume of 5 agents and 1,000 events per day per agent, is approximately:

  • Anthropic API tokens (70% Haiku, 30% Sonnet, 70% cache hit): ~$15,000 per year.
  • DBOS Conductor Pro: $1,188 per year.
  • Langfuse self-hosted infrastructure: ~$1,200 per year.
  • Inspect AI plus Promptfoo: $0.
  • MCP infrastructure (FastMCP servers run inside existing services): $0 incremental.
  • Cursor Pro per engineer: $20 per month per seat, separate procurement.

Total agent-farm infrastructure: approximately $17,500 per year at Real volume, dominated by inference. At 10x scale, the inference line scales linearly to ~$150,000 per year, the DBOS line stays bounded by the Pro tier ceiling (with Teams at $5,988/year as the next step), and the Langfuse line scales with the trace volume but remains an order of magnitude cheaper than commercial alternatives.

The Migration Sequence

5.3 The Migration Sequence

The migration is sequenced to preserve the IP, reduce risk through staged cutover, and produce visible progress at each weekly checkpoint. Total elapsed time: approximately 8–10 engineer-weeks for one senior engineer with a junior pairing for the test-migration phase.

Week 1: SDK migration, dev/prod symmetry preserved. Replace the claude -p subprocess with the Anthropic Agent SDK in-process. Set setting_sources=["project", "user"] so the SDK loads the existing .claude/skills/*/SKILL.md and .mcp.json verbatim. Convert each .claude/agents/*.md sub-agent definition to an AgentDefinition (one line each). Replace bash hooks with HookMatcher declarations. Cutover happens behind a feature flag; the existing claude -p path remains as fallback for the first week of production. Acceptance criterion: the existing 668 tests pass against the SDK path.

Week 2: durability layer. Wrap the four-step pipeline in @DBOS.workflow; wrap each model call and tool call in @DBOS.step. Migrate the existing SQLite insight database to RDS Postgres (it is already a Postgres-shaped schema; the migration is essentially a pg_dump and a connection string change). The DBOS workflow tables live alongside the insight tables in the same Postgres instance, so workflow state and business state can be written in one transaction. Acceptance criterion: kill a worker mid-pipeline and observe the workflow resume on a fresh worker with no duplicated side effects.

Week 3: observability cutover. Stand up Langfuse on a small ECS service with ClickHouse and Postgres. Instrument the agent code with OpenInference. Dual-write traces to the existing Svelte dashboard and to Langfuse for one week to verify parity. After parity is confirmed, deprecate the Svelte dashboard. Wire OpenTelemetry → Langfuse so every model call, every tool call, every workflow step is a span tagged with the agent name, the deal ID, and the cost. Acceptance criterion: on-call can find any agent run by deal ID and see every model call, every tool result, and the cost in under 90 seconds.

Week 4: eval harness, prompt versioning. Sample 200 production traces from the prior month into an Inspect AI dataset, scored by a combination of an LLM-as-judge rubric and human review. Wire CI to run the dataset on every prompt or model change, blocking the merge if scoring regresses by more than a configured threshold. Start versioning prompts in Langfuse (for the editing UI) with the canonical version still in prompts/ (for the runtime). Acceptance criterion: a deliberate prompt regression in CI gets caught before merge.

Week 5: cutover. Promote the new stack to be the only path. Remove the claude -p fallback. Move production billing from Anthropic Team plan to API-key billing. Verify cost projections against the first week of API-key invoicing. Acceptance criterion: zero production incidents over the cutover week, and the invoice within ±15% of the pre-cutover forecast.

Weeks 6–7: hardening and the second agent. Backport CLAWD’s IP into a generalisable pattern: the four-step Haiku-then-Sonnet pipeline becomes a reusable template; the insight database becomes a shared Postgres schema; the Slack Block Kit reporting becomes a shared module. Use the second agent (CCTP analyzer or executive briefing — whichever is next on the product roadmap) to validate that the second agent costs roughly a tenth of the first to ship. Acceptance criterion: the second agent reaches a working prototype in under one engineer-week, against the first’s months.

Weeks 8–10: documentation, runbook, and the operational handoff. Write the runbook for “what does on-call do when an agent gets stuck.” Document the stack for new engineers in a 4-page architecture brief. Verify the disaster-recovery posture: restore from backup, replay a week of webhooks, watch the durable workflow engine reconcile state. Hand off operational ownership from the engineer who built the stack to the on-call rotation.

The total engineering investment is approximately one engineer-quarter at one senior FTE, with a junior pairing for two of the ten weeks. The IP — prompts, agent decompositions, eval datasets, operational learnings — survives the entire migration. The plumbing is replaced.

Alternative Stacks (Sensitivity Analysis)

5.4 Alternative Stacks (Sensitivity Analysis)

The recommended stack assumes that Anthropic remains the right model provider, that Postgres remains the right durability backbone, and that self-hosted observability is acceptable to operate. If any of those assumptions breaks, the stack adjusts. Three alternative configurations are worth defending in advance.

Alternative A: PydanticAI plus Logfire (the “Anthropic-lock-in-is-a-dealbreaker” stack). If the IT team’s vendor-lock-in concern outweighs the migration-friction argument for the Anthropic Agent SDK, the right substitute is PydanticAI as the runtime, Logfire as the observability paired with it, and the rest of the stack unchanged (DBOS for durability, MCP for integration). PydanticAI is provider-agnostic by design, runs every model through a unified Agent[Deps, Output] typed interface, and is officially supported by DBOS as a durability backend. Logfire is OTel-pure and at Hestiia’s volume sits inside the free tier. The migration cost from CLAWD-SALES-AGENT is comparable to the recommended path (~2–4 engineer-weeks for a senior, mostly because PydanticAI is a thinner abstraction than the Anthropic Agent SDK and the Skills format does not port directly — they become PydanticAI tool functions instead). The IP — prompts, decomposition, insight schema, tests — still ports cleanly. The trade-off: PydanticAI is a library, not a platform, so when the team eventually needs human-in-the-loop primitives, multi-day workflow suspension, or graph-shaped control flow, more is built in-house. For Hestiia’s current shape, this is acceptable.

Alternative B: LangGraph plus Langfuse self-hosted (the “we want enterprise-procurement cover” stack). If the company is preparing for an acquisition, a Series B with enterprise-credibility-conscious investors, or a customer base that asks “what framework are you on” as a procurement gate, LangGraph is the answer that does not require explanation. Klarna, LinkedIn, Uber, and roughly a third of the Fortune 500 give it cover that Mastra and PydanticAI cannot. The migration cost is higher (~4–6 engineer-weeks) because the StateGraph abstraction is meaningfully thicker than the Anthropic Agent SDK. The pricing risk on LangSmith is real and load-test-mandatory, which is why the recommendation pairs LangGraph with Langfuse self-hosted rather than the LangSmith managed stack. The IP ports; the disaster scenarios in §4.2 are real and should be priced into the decision.

Alternative C: Strands on ECS plus AgentCore (the “deeper into AWS” stack). If the procurement direction is to consolidate everything onto AWS — same bill, same IAM, same compliance posture — the Strands SDK on ECS with Anthropic-direct calls (and an optional graduation to AgentCore Memory plus Runtime later) is a credible second answer. The Strands SDK is provider-agnostic and Apache-licensed, so the lock-in profile is reasonable; AgentCore’s I/O-wait-free billing model is genuinely well-shaped for agents that wait on tool calls. The migration cost is comparable to Alternative A. The trade-off: Strands is younger than the Anthropic Agent SDK or PydanticAI, with a smaller community, and the AgentCore product is itself a 2026 GA whose maturity has not been stress-tested at Hestiia’s scale.

Alternative D: PydanticAI on DBOS Transact with self-hosted everything (the “no closed-source SaaS in the loop” stack). If procurement, an enterprise customer’s security questionnaire, or an EU sovereignty mandate puts a 90-day deadline on removing every closed-source SaaS dependency from the runtime path, the stack reorganises into a configuration that is fully OSS-licensed and self-hostable, at a measurable but not catastrophic cost premium. The runtime is PydanticAI (MIT) on DBOS Transact (MIT) backed by self-hosted Postgres on RDS or on EC2 with EBS, depending on how strict the “no managed SaaS” reading goes. Observability is Langfuse self-hosted (MIT) on its standard ClickHouse-plus-Postgres footprint, instrumented through OpenLLMetry (Apache-2.0) rather than the closed Logfire SaaS. Eval is Inspect AI (MIT) in CI alongside Promptfoo (MIT) for prompt-and-model regression tests. The model gateway is LiteLLM (MIT) running as the proxy layer, fronting vLLM serving Llama 4 70B on a reserved H100 (on-prem or on a single-tenant cloud GPU contract), with the Anthropic API kept wired as a fallback for the orchestrator step until OSS frontier-model quality catches up — a pragmatic concession that the procurement team can ratify or refuse, and that the rest of the stack does not require. Tools are FastMCP (MIT) servers; browser automation is Microsoft Playwright MCP (Apache-2.0) in place of Browserbase or Stagehand-on-Browserbase; memory is Graphiti (Apache-2.0) on self-hosted Neo4j or Letta (Apache-2.0), both substituted in for any closed memory product; voice, if Hestiia ever revisits it, is LiveKit Agents (Apache-2.0) rather than Vapi or Retell. Engineering productivity moves to opencode (MIT) in place of Cursor, on the same principle.

The cost arithmetic is the load-bearing part of the argument. At Real volume — five agents at one thousand events per day each — the OSS-purist configuration runs to roughly $32,000 per year all-in, against the recommended stack’s roughly $17,500 per year at the same volume. The roughly 2× delta is structural, not an artefact of any single component: GPU inference does not amortise at small fleet size, and a reserved H100 (~$2.50/hour committed, ~$22,000/year before utilisation) is the dominant fixed cost. The crossover point — where self-hosted vLLM on Llama 4 actually beats Anthropic API tokens — lands at roughly 10× current volume, which is “Stretch” or beyond on the projection in §5.3. The premium today buys provider independence, not unit-economics improvement.

The quality trade-off is the other load-bearing part, and is honest. There is no OSS frontier model in April 2026 at Sonnet 4.6 quality on the long tool-use-and-reasoning chains that the orchestrator step relies on. Llama 4 70B closes most of the gap; on tool-use specifically, public benchmarks place it at roughly Claude 3.7 Sonnet equivalent, which is one model generation behind. For the router step (Haiku-class), the gap is irrelevant — Llama 3.3 70B beats Haiku 4.5 on most tool-routing benchmarks at a fraction of inference cost when self-hosted, and that substitution is a free win even in the recommended stack. For the orchestrator, it is a measurable quality regression that shows up in eval scores, not a wash.

The verdict: Alternative D is the right answer for the day procurement requires it, and the wrong answer for today. It is built around the structural ceiling — OSS frontier-model quality — that is forecast to close on a 12–24 month horizon, not a same-quarter horizon. The discipline today is to keep the recommended stack architecturally compatible with this configuration (PydanticAI is a 2–4 engineer-week swap per Alternative A, Langfuse self-hosted is already the recommendation, FastMCP is already the recommendation), so that if the deadline arrives, the migration is mechanical. Bookmark this stack. Do not build it.

The recommended stack is the right answer for Hestiia today specifically because the existing CLAWD-SALES-AGENT is on-ramp for it. A greenfield 20-person company with no existing prototype could rationally pick any of the four alternatives. The path of least migration friction wins for a team that has already paid the cost of building the IP. The deployment-topology question — whether the recommended stack runs on AWS or on the Mac Studio already on the team’s desk — is a separable decision and is treated in §5.5.

Deployment topology: where does the harness live

5.5 Deployment topology: where does the harness live

The §5.2 stack — Anthropic Agent SDK plus DBOS Transact plus Postgres plus Langfuse plus FastMCP servers — is a stack composition, not a deployment plan. The composition is the same in both paths below; what changes is the substrate it runs on, and the substrate decision is separable enough to deserve its own section because the trade-offs do not collapse to a single answer. The cloud-hosted version of the stack is the lowest-friction onboarding for a team that is already deep in AWS and already running Postgres and ECS in production. The single-box local version is materially cheaper, has a smaller AWS bill, and benefits from the fact that the team already has a Mac Studio (M2 Max, 64 GB unified memory) on a desk. The agent farm is an internal tool serving a 20-person company; 99.99% reliability is not the target, and the topology decision should be sized to that reality rather than to a hypothetical larger company’s risk posture.

Path 1: cloud-hosted, the §5.2 stack as-deployed. The recommended stack runs on AWS ECS for the agent runtime, RDS Postgres for durable workflow state and the insight database, and a small ECS service for Langfuse self-hosted with its ClickHouse-and-Postgres footprint. At the current low volume — two salespeople, webhook-driven, single-digit events per hour at peak — the AWS plumbing floor dominates the bill: roughly $100/month for the Langfuse service, ~$50/month for RDS Postgres (db.t4g.small with backups), ~$30–60/month for the agent runtime on Fargate plus an ALB plus NAT egress, and the Anthropic API line on top of that at $20–50/month under the §5.7 caching-and-batch-and-output-discipline optimisations. Total: roughly $200–260/month, of which ~$180 is plumbing that does not get cheaper at lower volume because of AWS minimum-instance pricing. The configuration inherits Hestiia’s existing AWS operational practice, survives French regional power and internet outages affecting the office (because the harness is not in the office), and scales linearly as inference dominates the bill past the recommended-stack break-even at a few thousand events per day.

Path 2: single-box local on existing hardware. The same stack collapses onto the Mac Studio that is already on the team’s desk. Postgres runs natively on Apple Silicon. Langfuse self-hosted runs through Docker Desktop on its standard ClickHouse-and-Postgres footprint with ARM64 images for both. The Anthropic Agent SDK, DBOS Transact, FastMCP servers, the OpenInference instrumentation, and the Inspect AI eval pipeline all run as native Python processes. The resident set of all the plumbing — Postgres, ClickHouse, Langfuse, OTel collector, agent process, MCP servers — lands at roughly 10 GB on the 64 GB box, leaving generous headroom for whatever local-inference experimentation the team eventually wants. The capex line is zero because the hardware is already amortised against the developer-tooling budget. Electricity under sustained load runs at ~$15/month. The Anthropic API line is identical to Path 1 at $20–50/month for the orchestrator and triage steps under the same §5.7 optimisations, because the model layer does not change between paths. Total monthly opex: $35–65, against Path 1’s $200–260 — the delta is entirely the AWS plumbing floor that the local box collapses into hardware-already-on-hand.

The operational posture is the load-bearing trade-off and is worth stating honestly. The configuration runs on a single piece of hardware, and the failure modes are the obvious ones: thermal shutdown, power cut, upstream internet outage, the developer-grade Mac receiving an OS update at the wrong moment. The recovery story is rebuild-from-backup: Postgres dumps, Langfuse ClickHouse snapshots, and the configuration manifests live in a versioned S3 bucket on a documented rotation, and the rebuild procedure is rehearsed quarterly so the steps are known and bounded rather than discovered under pressure. The honest framing is that an extended office power outage or upstream internet outage will take the agent farm down for the duration of the outage; the deliberate choice is to accept that this is fine for an internal tool on a 20-person team where no salesperson is sitting in the agent farm’s loop in real time, where the agent farm’s outputs land in Slack as advisory reports rather than blocking actions, and where the same office outage would also disconnect the team from a cloud deployment for any practical purpose anyway. The two 32 GB Mac Studios on the same desk are not a production-redundancy story — standby would not survive a coupure de courant any better than the primary — but they earn their keep as engineer dev workstations and as a CI runner for the eval suite, both of which benefit from being shaped identically to production while sitting outside the production runtime.

The triggers for re-evaluating Path 2 against Path 1 are explicit and worth naming so the decision is reviewable rather than sticky. First, if webhook volume crosses the orchestrator-concurrency ceiling on the M2 Max — concretely, the §5.7 second breakpoint of roughly 5,000–10,000 events per day, where serial inference on a single box starts queueing — the unit-economics argument flips and Path 1 wins on throughput rather than cost. Second, if the team grows past one or two operators who can run the box, or if operational ownership needs to land somewhere outside engineering, the cloud configuration’s “ECS and RDS, just like the rest of our infrastructure” framing earns its premium. Third, if anything downstream of the agent farm starts depending on it — a customer-facing surface, the mobile app’s recommendation pipeline, anything where downtime is visible to someone outside the salespeople reading the Slack reports — the reliability bar lifts and the single-box configuration is no longer sized correctly. Until any of those triggers fires, the box on the desk is the lower-friction answer; once any of them fires, the migration to the cloud configuration is mechanical because the stack composition has not changed.

Residual Risks

5.6 Residual Risks

Every stack choice has known failure modes. The recommended stack’s residual risks are the ones to watch and mitigate.

Risk 1: Anthropic-specific outage. A region-wide Anthropic outage cuts the agent farm’s availability. The mitigation is a documented Bedrock fallback configuration — kept current quarterly, tested twice a year — that points the SDK at Bedrock-hosted Sonnet with a single config flag. The fallback is not zero-cost (Bedrock and Anthropic-direct have parity on token rates and on cache-read pricing for Sonnet 4.6 as of Q1 2026, but the IAM and VPC configuration is non-trivial). It does, however, mean that the answer to “what do we do during a four-hour Anthropic outage” is “flip the flag and run on Bedrock,” not “wait it out.”

Risk 2: Anthropic deprecates Skills, the Agent SDK, or Managed Agents. The Agent SDK and Skills are GA with multi-quarter track records and no signs of deprecation. Managed Agents is in public beta and explicitly deferred in this recommendation. Anthropic has previously deprecated APIs (the original Claude completions endpoint) on roughly twelve-month notice, which is enough time to migrate. The mitigation is the portability discipline: prompts in prompts/, not embedded in the SDK; agent decompositions as readable Markdown, not framework-specific YAML; tools exposed via MCP, not as SDK-native function decorators where possible. Skills format is an open standard registered at agentskills.io, which reduces the deprecation risk further.

Risk 3: DBOS the company hits a bad cycle. The DBOS Transact library is MIT-licensed and works without the commercial Conductor control plane; the worst case is that Conductor goes away and Hestiia continues on the OSS library, losing the recovery UI and alerting but keeping the durability semantics. Stonebraker and Zaharia are not the bus factor most teams worry about. The deeper risk is that DBOS does not ship an official Anthropic Agent SDK adapter — the Anthropic SDK is wrapped as @DBOS.step-decorated functions today, which works but means Hestiia is ahead of the official integration. Mitigation: contribute the adapter upstream when the wrapper stabilises.

Risk 4: trace volume explodes faster than projected and Langfuse self-hosted falls over. ClickHouse plus Postgres on a small ECS service handles low millions of spans per year comfortably; tens of millions per year requires a non-trivial ClickHouse cluster. The mitigation is the OpenInference instrumentation: at the moment Langfuse self-hosted becomes the bottleneck, point the OTLP exporter at a managed backend (Logfire, Datadog, Phoenix-managed) without changing application code.

Risk 5: the stack is correct but the team lacks the operational muscle to run it. This is the under-named risk. Self-hosted Langfuse, self-hosted Postgres at scale, custom MCP servers, OpenTelemetry pipelines — these are real operational surface area for a small team. The mitigation is to grow into them deliberately: add Langfuse self-hosted in Week 3, give it three months to prove out before adding the eval pipeline, give the eval pipeline three months before adding the second agent. The 90-day plan in §5.8 sequences this explicitly.

Risk 6a: plumbing lock-in to Anthropic, which is mostly mythical. The framing that “the recommendation locks Hestiia into Anthropic” conflates two different exposures, and the plumbing half does not survive scrutiny. The Anthropic Agent SDK is MIT-licensed; DBOS Transact is MIT; Langfuse is MIT; FastMCP is MIT; OpenInference is Apache-2.0; Inspect AI is MIT. None of these are Anthropic-controlled, and none of them stop working if the relationship with Anthropic-the-company changes. The runtime layer is the only Anthropic-authored component in the stack, and the migration cost out of it — to PydanticAI, per Alternative A — is 2–4 engineer-weeks for a senior engineer, with the IP (prompts, decomposition, eval datasets, Skills, MCP servers) porting directly. The mitigation is the portability discipline already named in Risk 2: keep prompts in prompts/ rather than embedded in SDK syntax, keep decompositions as readable Markdown, keep tools exposed via MCP rather than SDK-native function decorators where possible, keep eval datasets and prompt-decomposition logic in version control. Anthropic-specific syntax must not be allowed to ooze into the prompts themselves. With that discipline, the plumbing is portable, the lock-in is a feature flag and a sprint, and the framing collapses.

Risk 6b: model lock-in to Sonnet 4.6, which is structurally real. The harder risk is the one underneath the plumbing risk: Hestiia’s orchestrator step depends on Claude Sonnet 4.6 specifically for the kinds of long tool-use-and-reasoning chains that drive the agent farm’s quality, and there is no drop-in substitute in April 2026. Swapping to GPT-5 is a quality regression on this workload class, not a portability problem; swapping to Llama 4 70B is a deeper regression, roughly Claude 3.7 Sonnet-equivalent on tool-use benchmarks per Alternative D. The model is not the SDK. The SDK is a sprint; the model is the hard ceiling. The mitigation is layered. First, pin the model string explicitly in code — claude-sonnet-4.6-20250515 or whatever the dated string is, never an implicit claude-latest — so a silent Anthropic-side rename does not silently change behaviour. Second, keep the Bedrock fallback wired from Risk 1, so a multi-hour Anthropic-the-company outage does not take Hestiia down even though the model is the same. Third, instrument with portable OTel via OpenInference or OpenLLMetry so observability survives a model swap untouched, and the eval suite continues to measure quality across providers on identical traces. Fourth, keep an evergreen swap-test in CI that periodically runs the eval suite against GPT-5 and Llama 4 70B — perhaps weekly — so the quality gap is measured, not assumed, and the day Llama 5 closes it, the data is already on the dashboard. Revisit the model choice annually as OSS frontier-model quality closes.

The honest framing for the board is that the OSS frontier-model gap is the structural ceiling on 100% provider independence today. A team that wants to remove Anthropic-the-vendor entirely from the runtime path must either (a) accept the roughly 2× cost premium and operate vLLM on H100s per Alternative D, or (b) accept the quality regression and run Llama 4 for everything including the orchestrator, or (c) wait for OSS frontier to close the Sonnet gap, which is a 12–24 month bet, not a same-quarter switch. The recommended stack picks none of those because none of them are necessary today. The discipline is to stay ready to pick (a) or (c) on a quarter’s notice the moment the procurement reality changes.

Risk 7: the recommendation is right for today and wrong for 2028. The single most-important architectural commitment in this book is the MCP-first integration discipline. Frameworks change every eighteen months; MCP is now Linux Foundation infrastructure. Even if the runtime, durability, observability, and gateway choices all turn out to be wrong in retrospect, the MCP servers Hestiia builds today work against whatever framework wins in 2028. The IP — prompts, decomposition, eval datasets, MCP servers — is portable. The plumbing is replaceable. That asymmetry is the bet.

Cost Optimisation and the claude -p Question

5.7 Cost Optimisation and the claude -p Question

A common reflex when sizing an agent farm’s running cost is to assume that a Claude Max subscription at $200/month is dramatically cheaper than equivalent Anthropic API consumption, and that wrapping claude -p in a webhook handler is therefore the right cost play for an internal automation. The reflex is correct in the case it was designed for — a single developer doing interactive coding — and structurally wrong in the case it is most often invoked for, which is low-volume server-side automation. The arithmetic flips, and the discipline of running it before deciding is worth a section of its own.

The CLAWD-SALES-AGENT base case. The agent runs three Claude calls per inbound Pipedrive event after debouncing: a Haiku 4.5 triage step, a Sonnet 4.6 orchestrator step, and a Sonnet 4.6 insight-update step. Per-event token shape is roughly 1K input + 100 output for triage, 17K input + 1K output for the orchestrator (a ~5K-token static header — agent intent, skills catalogue, process knowledge base — plus ~12K of variable deal context), and 6K input + 1K output for insight-update. At Sonnet 4.6 list pricing of $3 input / $15 output per MTok and Haiku 4.5 list pricing of $1 input / $5 output per MTok, that lands at approximately $0.10 per processed event and $0.0015 per skipped event. With two salespeople and the resulting traffic — estimated at 10–30 inbound events per day after debouncing, of which the triage step skips roughly a quarter — the steady-state monthly bill is $25–80 in API tokens, plus a few dollars per new deal in one-shot bootstrap calls. The Max sub at $200/month is 3–8× more expensive than the API at this volume, not cheaper. The arbitrage many CTOs assume runs the wrong direction; the only scenario in which claude -p wins on cost at this scale is when the sub is already sunk cost for an engineer’s interactive coding work, in which case the marginal cost of one more claude -p invocation is zero up to throttling.

The API discount stack — what is actually on the menu. Four optimisations layer on the naive Sonnet 4.6 baseline, each with a defensible savings figure and each with a different failure mode.

  • Prompt caching. Cache reads bill at $0.30 per MTok (10% of input cost); cache writes bill at $3.75 per MTok (1.25× input) for the default 5-minute TTL or at $6 per MTok (2× input) for the 1-hour TTL — the 1-hour write only pays back above roughly five reads per write, so it is a deliberate choice for hot prompts, not a default. The mechanism marks specific message blocks with cache_control: {"type": "ephemeral"}. The right targets are stable system prompts, tool definitions, and reused RAG context. For an agent like the CLAWD-SALES-AGENT orchestrator with a ~5K-token static header and ~12K tokens of per-call deal context, caching the static block saves ~21% of input cost per cache hit — a 10–15% net reduction at realistic hit rates with the 1-hour TTL. The maximum theoretical savings is ~90% on the cached portion when the static block dominates the prompt and the cache stays warm; that is rare for variable-context agents and routine for chatbots with long instruction blocks and short user turns.

  • Batch API. Submitting requests in a JSONL batch returns results within 24 hours at 50% off both input and output. The use case is async enrichment, retrospective scoring, nightly classification — anything where the user is not waiting on the response. For CLAWD-SALES-AGENT, the orchestrator is real-time-feedback (it drives the Slack notification a salesperson sees within seconds) and is therefore not batchable; the insight-update step is batchable (it persists deal memory after the fact, and a 2–10 minute lag is invisible to the user) and would cleanly halve that line item.

  • Model tiering. Haiku 4.5 is roughly 3× cheaper than Sonnet 4.6 on both input and output, and is genuinely strong at routing, classification, and short extractions. The pattern is Haiku as dispatcher, Sonnet 4.6 for reasoning, Opus 4.7 only when the workload demonstrably needs it. CLAWD-SALES-AGENT already runs this pattern correctly on its triage step. Any agent that does not is leaving 50–60% on the table by paying Sonnet rates for traffic Haiku could route equivalently.

  • Output discipline. Output costs 5× input on Sonnet, and agents over-narrate by default. A conservative max_tokens cap, structured-output schemas via JSON mode or tool-calling, and explicit “respond in N words” framing in the system prompt clamp this at the source. The largest savings in the stack at the lowest engineering cost; it tends to be left on the table because it is unglamorous.

Stacked, those four take a typical Sonnet workload to roughly 20–30% of naive cost for caching-friendly chatbots and to roughly 70–85% of naive cost for variable-context multi-tool agents like CLAWD-SALES-AGENT. The variance matters: prompt caching pays out on agents with stable headers, not on ones whose every call is novel.

What claude -p actually gives you, and what it doesn’t. The CLI invocation backs onto the Max subscription’s billing model, gets the same models, exposes them via stdin/stdout JSON, and supports OTel telemetry through CLAUDE_CODE_ENABLE_TELEMETRY=1 plus standard OTLP exporter env vars — so observability via Langfuse plus OpenInference is achievable without rewriting the call layer. What it does not give you, regardless of how thin the wrapper is: explicit cache_control on message blocks, the Batch API endpoint, fine-grained model pinning across versions, sub-second cold-start latency (each subprocess invocation pays a Node startup tax of 200–500 ms, three times per event in CLAWD-SALES-AGENT’s pipeline — under user-perceptible 2-second latency budgets, this matters), and a clean commercial relationship for usage that is not “one developer typing.” The Consumer Terms are written for interactive personal use, and Anthropic’s February 2026 clarification put third-party-tool access to Max-subscription auth tokens explicitly outside the consumer license. The CLI itself is the exemption — it is Anthropic’s own scripted-use product — but a webhook handler invoking it on inbound business events is not the use case the exemption was written to cover, and the gap widens with concurrent users, fanned-out webhooks, and traffic shapes that look API-like to Anthropic’s anti-abuse signal. For a 20-person company that will eventually field an enterprise security questionnaire, this is a procurement liability, not a grey zone.

Hestiia’s answer today. At two salespeople and webhook volume in the tens-per-day range, the right cost-and-correctness combination is a direct Anthropic API integration with prompt caching wired on the orchestrator’s static header (~10–15% savings, ~$3–10/month at this scale), Batch API wired on the insight-update step (~50% off that line item, ~$5–10/month), the existing Haiku-triage-into-Sonnet-orchestrator tiering preserved as-is, and tight output caps applied across all three steps. Total monthly bill projects to $20–60 with optimisations against the Max sub’s $200 — not just a structural win on cost but a structural win on latency (kill three subprocess spawns per event), observability cleanliness (per-call cost attribution, model pinning, structured outputs natively), and procurement positioning (commercial terms, not consumer terms). The migration is part of the §5.8 90-day plan’s Day 15–28 window. claude -p continues to be the right choice for the engineering team’s interactive coding work, where Anthropic priced it specifically to win.

How this changes at higher scale. The arithmetic crosses three breakpoints worth naming in advance.

  • At ~75–150 events per day across the agent farm, the optimised API bill crosses the Max sub’s $200 fixed cost. From here forward, the question is no longer “API or sub” — the API is unambiguously cheaper per unit, and the sub becomes a fixed-cost ceiling on a single developer’s interactive use, not a savings vehicle for the farm.

  • At ~5,000–10,000 events per day, Batch API and prompt caching savings start to compound visibly enough that re-architecting calls around them — moving non-real-time agents fully onto Batch, raising cache TTLs from 5 minutes to 1 hour, splitting hot-path and cold-path agents into separate code paths — becomes a quarter-of-engineer-time investment that pays out within months. This is also the volume at which a model gateway (LiteLLM, Portkey, or one of the OpenAI-compatible proxies named in §3.13) starts earning its keep, by allowing Sonnet and Haiku rate limits to be load-balanced across regional deployments and surge-routed across providers.

  • At ~50,000+ events per day with a stable workload shape, the per-step substitution argument from §5.4 Alternative D becomes tractable: self-hosted vLLM on a reserved H100 starts paying out for the router fleet specifically — per Alternative D’s router-step concession — where Llama 3.3 70B sits at-or-above Haiku 4.5 quality on tool-routing benchmarks at a fraction of the per-token rate when amortised across continuous load, even though the orchestrator stays on Sonnet 4.6 because no OSS model closes the quality gap on long tool-use chains in April 2026. This is the volume at which “leave Anthropic-the-vendor for parts of the farm” stops being a procurement story and starts being a unit-economics story.

The discipline at every breakpoint is the same. Run the actual numbers off the actual costs table in the actual production database — for CLAWD-SALES-AGENT, the costs rollup in data/agent.db, which records input_tokens, output_tokens, and cost_usd per call — before deciding the structural answer. The sub-vs-API question’s answer changes with volume, with workload shape, and with the agent farm’s mix of real-time and async traffic. The intuition that the sub is “obviously cheaper” is a category error that costs CTOs a multiple of the bill they thought they were avoiding.

The 90-Day Plan

5.8 The 90-Day Plan

A concrete 90-day plan, sequenced to land the migration in the order that produces visible value at each checkpoint while keeping operational risk bounded.

Days 1–14: foundation. Stand up Langfuse self-hosted on an ECS service with ClickHouse and Postgres. Instrument CLAWD-SALES-AGENT with OpenInference, dual-writing to the existing Svelte dashboard and to Langfuse. Stand up the API-key billing relationship with Anthropic; verify cost projections with a one-week shadow run. Acceptance gate: any engineer can find any production agent run by deal ID in Langfuse in under 90 seconds.

Days 15–28: durability and SDK migration. Wrap the four-step pipeline in @DBOS.workflow; migrate SQLite to RDS Postgres; replace claude -p with the Anthropic Agent SDK behind a feature flag. Cutover behind the flag for a week of canary traffic before the flag flips for all events. Acceptance gate: a deliberate process kill mid-pipeline produces zero duplicated side effects and the workflow resumes on a fresh worker.

Days 29–42: cutover and eval. Flip the feature flag for all events. Decommission the Svelte dashboard. Stand up Inspect AI in CI with a 200-trace dataset sampled from production. Wire prompt versioning in Langfuse. Acceptance gate: a deliberately-regressed prompt fails CI before merge.

Days 43–60: hardening and runbook. Write the on-call runbook. Verify disaster recovery (restore from backup, replay a week of webhooks). Move the Pipedrive MCP from a stdio subprocess to an in-process @tool wrapper. Document the architecture in a 4-page brief for new engineers. Acceptance gate: an engineer who joined a month ago can deploy a code change to the agent without supervision.

Days 61–90: the second agent. Use the cleared substrate to ship the second agent — whichever is next on the product roadmap (CCTP analyzer or executive briefing). The engineering investment for the second agent should be roughly one engineer-week against the first’s months, because the substrate is shared. Acceptance gate: the second agent ships to production and produces measurable business value within 90 days of the migration starting.

The 90-day plan is sequenced so that every two-week window produces a visible artifact: a Langfuse dashboard at Day 14, a durable workflow at Day 28, a passing eval suite at Day 42, a runbook at Day 60, a second production agent at Day 90. The plan is also sequenced so that the highest-risk operations (durability cutover, billing migration) happen on weeks where the team is otherwise unencumbered.

The pattern at the end of 90 days is a working agent farm whose IP is the same as CLAWD-SALES-AGENT’s, whose plumbing is professional, whose observability is portable, whose durability is non-negotiable, whose integration discipline is MCP-first, and whose marginal cost of shipping the third, fourth, and fifth agent has dropped by an order of magnitude. The board has a defensible answer to “what is your AI strategy.” The IT team has a defensible answer to “how do we operate this in production.” The engineering team has a stack that survives the next two years of vendor churn.

The recommendation is opinionated, but the opinions are defensible. The alternatives are real. The migration sequence is concrete. The 90-day plan is testable. That is the answer to the question this book set out to address.


Appendix A — Vendor Index (with pricing)

A consolidated index of every vendor named in the book, organised by category, with current pricing, licensing, and the section where each is treated in depth. All pricing is as of April 2026 and dated; verify before quoting in any procurement document.

Agent runtimes (Section 3.1–3.9)

VendorWhat it isPricingLicense§
MastraTypeScript agent framework, XState workflows, Replit/Brex/Adobe in productionOSS Apache-2.0; Cloud Starter $0, Teams $250/team/mo + $10/100k events + $10/GB egressApache 2.0 + Enterprise3.1
LangGraphPython/TS stateful agent framework, Pregel runtime, durable interruptsOSS MIT; Plus $39/seat/mo, +$2.50/1k traces; Platform $0.001/node + standby feesMIT3.2
LangSmithLangChain Inc.’s observability + prompt managementPlus $39/seat/mo; Enterprise custom (peer reports $2–5k/mo+)Closed (cloud) / OSS-self-host gated3.2, 3.11
Anthropic Agent SDKPython/TS SDK; loads Skills, MCP, hooks; the claude -p twinFree; per-token Anthropic API pricing appliesMIT3.3
Anthropic Managed AgentsHosted runtime, public beta April 8 2026$0.08/session-hour + tokens; no GA pricing committedClosed3.3
Anthropic Claude Code / claude -pMature CLI; Skills + MCP filesystem-loadedTeam plan $20/$100/seat/mo; API-key required for productionClosed3.3
Agent SkillsSKILL.md open standard, registered at agentskills.ioFreeOpen standard3.3
OpenAI Agents SDKPython/TS, Sessions/Handoffs/Tools/Guardrails; built on Responses APIFree SDK; tokens per OpenAI rates; hosted tools metered separatelyMIT3.4
InngestEvent-driven durable execution + AgentKit (TS-only)Hobby $0; Pro $75/mo (1M execs); Enterprise customSSPL (server) + Apache (SDK)3.5, 3.10
Trigger.devTypeScript-first durable execution with CRIUHobby $10/mo; Pro $50/mo + per-second computeApache 2.03.6, 3.10
PydanticAIType-safe Python agent framework from the Pydantic teamFreeMIT3.7
LogfireOTel-pure observability paired with PydanticAIPersonal $0 (10M spans); Team $49/mo + $2/M; Growth $249/moClosed (cloud)3.7, 3.11
AWS Bedrock AgentCoreDecomposable AWS agent runtime$0.0895/vCPU-hr + $0.00945/GB-hr; Memory + Gateway + Identity meteredClosed3.8
AWS Strands SDKApache-2.0 SDK paired with AgentCore; provider-agnosticFree SDK; AgentCore meters apply if hostedApache 2.03.8
Azure AI Foundry Agent ServiceMicrosoft’s consolidated agent stack“No additional charge”; Bing grounding $14/1k transactionsClosed3.8
Vertex AI Agent BuilderGoogle’s agent builder; Agent Engine runtime$0.0864/vCPU-hr + $0.009/GiB-hr; 50 vCPU-hr freeClosed3.8
Cloudflare AgentsDurable-Objects-as-agents; Project ThinkWorkers Paid $5/mo; DO + Workers AI + AI Gateway meteredOpen SDK / closed runtime3.9

Durable execution (Section 3.10)

VendorWhat it isPricingLicense§
TemporalCategory-defining external orchestrator, polyglotCloud Essentials $100/mo (1M actions); Business $500/mo (2.5M); $25–50/M overageMIT (server)3.10
RestateSingle-binary Rust durable execution + virtual objectsCloud Free 50K actions; Starter $75/mo (5M); Premium $1,000/mo (50M)BUSL 1.1 → Apache 2.03.10
DBOSPostgres-native durable execution as a libraryOSS Transact MIT; Conductor Pro $99/mo, Teams $499/moMIT3.10
HatchetPostgres-backed task queue + DAG + durable executionFree dev tier; Team $500/mo; Scale $1,000/mo (HIPAA)MIT3.10
PrefectPython data-orchestration veteran; Marvin agent kitHobby free; Team historically ~$400/mo; Pro customApache 2.03.10
AWS Step FunctionsAWS-native state machine orchestratorStandard $25/M state transitions; Express $1/M requestsClosed3.10
Cloudflare WorkflowsTypeScript durable execution on WorkersWorkers Paid $5/mo + per-million overagesClosed runtime3.10
AirflowApache batch-and-scheduler, not real durable executionFree OSS; managed clouds varyApache 2.03.10
DagsterAsset-graph data orchestratorSolo $10/mo + $0.04/credit; Starter $100/moApache 2.03.10

Observability and eval (Section 3.11)

VendorWhat it isPricingLicense§
LangfuseOSS LLM observability + prompt mgmt + evals; ClickHouse-acquiredHobby $0; Core $29/mo; Pro $199/mo; Enterprise $2,499/mo; self-host freeMIT (core)3.11
LangSmithLangChain Inc. observability + datasetsPlus $39/seat/mo + $2.50/1k traces; Enterprise customClosed3.11
BraintrustPremium evals-first platform with BrainstoreStarter $0; Pro $249/mo + $3/GB; Enterprise customClosed3.11
Phoenix / Arize AXSource-available observability + commercial managed; OpenInferencePhoenix $0; AX Free 25K spans/mo; Pro $50/moELv2 (Phoenix) / Apache 2.0 (OpenInference) / Closed (AX)3.11
HeliconeProxy-based observability; status uncertain (possible Mintlify acquisition)Hobby $0; Pro $79/mo; Team $799/mo — verify status before adoptionApache 2.03.11, 3.13
LogfirePydantic OTel-pure observabilityPersonal $0; Team $49/mo + $2/M; Growth $249/moClosed3.11
LunaryOSS observability with EU-residency focusFree 10K events; Team $20/user/moApache 2.03.11
PromptLayerPrompt-management-firstFree 5K req; Pro $50/seat/moClosed3.11, 3.23
Datadog LLM ObservabilityBolt-on to Datadog; per-LLM-span billingPer-span billing on top of Datadog seats; bill increases of 40–200% reportedClosed3.11
HoneycombHigh-cardinality OTel observabilityFree 20M events/mo; Pro $130/mo+Closed3.11
Inspect AIUK AISI MIT eval frameworkFreeMIT3.11
PromptfooOSS prompt-and-model testing in CIFree; paid Cloud tier existsMIT3.11, 3.23

Multi-agent frameworks (Section 3.12)

VendorWhat it isPricingLicense§
CrewAIRole-playing multi-agent Python frameworkBasic $99/mo; Standard $500/mo; Pro $1,000/mo; Enterprise $30–60k/yrMIT3.12
AutoGen / AG2 / Microsoft Agent FrameworkMS Research → forked → consolidated as MAF 1.0 (April 2026)FreeMIT3.12
Agno (formerly Phidata)Python SDK + AgentOS runtimeFree OSS; Pro $150/mo; Enterprise customApache 2.03.12
LlamaIndex WorkflowsRAG-anchored multi-agent on LlamaIndexFree OSS; LlamaCloud consumption-pricedMIT3.12

Model gateways (Section 3.13)

VendorWhat it isPricingLicense§
LiteLLMOSS Python SDK + proxy serverFree OSS; Enterprise customMIT3.13
OpenRouterHosted multi-provider gateway, ~300 models5.5% credit fee; no token markup claimedClosed3.13
PortkeyHosted gateway + observability with OSS coreSelf-host free; Production $49/moMIT (core) / Closed (cloud)3.13
Cloudflare AI GatewayEdge-hosted gateway, free on every Cloudflare planFree; Logpush $0.05/M past quotaClosed3.13
Vercel AI SDKTypeScript code-layer abstractionFreeApache 2.02.5, 3.13, 3.16
Kong AI GatewayKong’s existing API gateway with AI pluginsPer Kong pricingClosed (commercial)3.13

Low-code / no-code (Section 3.14)

VendorWhat it isPricingLicense§
n8nSelf-hostable fair-code workflow with AI Agent nodeCommunity free; Cloud Starter €20/mo; Business €667/moSustainable Use License3.14
ActivepiecesOSS AI-first Zapier alternative; 400+ MCP serversSelf-host free; Plus $25/mo; Business $150/moMIT (core)3.14
PipedreamCode-first workflow with hosted MCP server (10K+ tools)Free 100 credits/day; Basic $29/mo; Advanced $79/moClosed (cloud)3.14
WindmillCode-first scripts-as-primitive, AGPLv3, AI Agent stepsSelf-host free; Cloud tiersAGPLv33.14
Make (Integromat)Credit-based iPaaS, AI AgentsFree 1K ops; Core $9/mo; Teams $29/moClosed3.14
ZapierDefault integrator; Zapier Agents now separate billingFree 400 activities; Pro 1,500 + add-onsClosed3.14
Lindy.aiAgents-as-a-product; voice (Gaia) + computer-use; on Pipedream ConnectFree 400 credits; Pro $49.99; Business $299/moClosed3.14
Tray.ioEnterprise iPaaSPro ~$99/mo list; realistic deployments $1k+/moClosed3.14
WorkatoEnterprise iPaaS heavyweightStandard $10k+/yr; Workato One (agentic) $144–216k/yrClosed3.14

MCP ecosystem (Section 3.15)

VendorWhat it isPricingLicense§
MCP spec 2025-11-25The protocol; donated to Linux Foundation Dec 2025Free open standardOpen standard2.6, 3.15
SmitheryMCP registry/marketplace + hostingFree to list; hosted execution usage-basedClosed (cloud)3.15
ComposioManaged integration layer, ~250+ apps + per-user OAuthHobby $29/mo (200K calls); Business $229/mo (2M calls)Closed3.15
Arcade.devComposio competitor; startup program for sub-100-employeeGrowth $25/mo + overages; $0.05/hr hosted MCPClosed3.15
Cloudflare Workers MCPWorkers/Durable-Objects-as-MCP-servers; Code Mode patternWorkers Paid $5/mo baseClosed runtime / open SDK3.15
FastMCPDominant Python MCP-server framework, 70% share, ~1M daily downloadsFreeApache 2.02.6, 3.15
Pipedream ConnectHosted MCP server fronting 3K+ APIs, 10K+ toolsPer Pipedream pricingClosed3.15
mcp.runWebAssembly-servlet registry, capability-restrictedPer platform tiersClosed3.15
mcp-useFullstack Python/TS MCP frameworkFreeMIT3.15
rmcpRust MCP implementationFreeApache 2.0 / MIT3.15

Minor frameworks (Section 3.16)

VendorWhat it isPricingLicense§
SmolagentsHugging Face’s “code-as-actions” minimalist libraryFreeApache 2.03.16
DSPyStanford NLP “programming not prompting” + optimisersFreeApache 2.03.16
BurrState-machine framework, Apache-incubatingFreeApache 2.03.16
Haystackdeepset’s retrieval-strong production frameworkFree OSS; Studio free 1 user; Enterprise customApache 2.03.16
Atomic Agents“Anti-framework framework” Python on Pydantic + InstructorFreeMIT3.16
Raw LangChainOriginal building-blocks libraryFreeMIT3.16
OpenHands (formerly OpenDevin)OSS autonomous-coding agent, ~40K starsFreeApache 2.03.16
CAMEL-AIMost-cited multi-agent research frameworkFreeApache 2.03.16
ControlFlowDiscontinued; merged into Prefect Marvinn/an/a3.16
Semantic KernelMicrosoft .NET-first SDK; merged into MAF 1.0FreeMIT3.16

Memory (Section 3.17)

VendorWhat it isPricingLicense§
Mem0YC OSS-core dedicated memory productFree tier; Pro $19/moApache 2.0 (OSS core) / Closed (cloud)3.17
Letta (formerly MemGPT)Tiered memory architecture from Berkeley paperOSS-first; cloud product availableApache 2.03.17
ZepTemporal-knowledge-graph memory; production-matureCloud Starter $39/mo; OSS Community EditionApache 2.0 (CE) / Closed (Cloud)3.17
GraphitiStandalone OSS temporal-knowledge-graph engine (powers Zep Cloud)FreeApache 2.03.17
Pinecone AssistantPinecone wrapper for chatbotsPer Pinecone tiersClosed3.17

Sandboxes and browser automation (Section 3.18)

VendorWhat it isPricingLicense§
E2BHosted Firecracker VMs for agent code executionPro $150/mo; pay-as-you-go ~$0.000014/CPU-secClosed (cloud)3.18
Modal sandboxesSandboxes inside Modal’s serverless GPU platformModal per-second compute pricingClosed3.18
DaytonaOSS dev-environment-as-a-service with sandbox APIOSS free; cloud per usageApache 2.03.18
Pyodide + DenoIn-browser/WASM bounded executionFreeOSS3.18
BrowserbaseManaged headless-browser-for-agents, $40M Series BFree tier; ~$0.05/min; team plans $99–$499/moClosed3.18
StagehandBrowserbase’s TypeScript frameworkFreeApache 2.03.18
Browser UseOSS Python browser-driving framework, ~91K starsFreeMIT3.18
SkyvernOSS forms-and-workflow browser automation, ~21K starsFreeAGPL 3.03.18
Microsoft Playwright MCPOfficial Playwright-as-MCP-server, ~32K starsFreeApache 2.03.18
microsandboxLightweight Rust microVM sandbox runnerFreeApache 2.03.18

Computer Use (Section 3.19)

VendorWhat it isPricingLicense§
Anthropic Computer UseClaude beta tool for screen-driven automationToken rates apply (screenshot tokens dominate)Closed3.19
OpenAI Computer UseOperator product line, Responses API toolGPT-4o image-input ratesClosed3.19

Voice (Section 3.20)

VendorWhat it isPricingLicense§
VapiDeveloper-platform real-time voice agents$0.05/min Vapi side + provider passthroughClosed3.20
RetellVapi competitor; templates + voice cloning$0.07–$0.10/min Retell side + passthroughClosed3.20
Bland AIEnd-to-end voice stack, fastest setup$0.09–$0.12/min typicalClosed3.20
LiveKit AgentsOSS WebRTC + voice-agent framework, ~10K starsFree; pay underlying STT/LLM/TTSApache 2.03.20
PipecatDaily.co’s OSS voice-agent framework, ~12K starsFree; pay underlying STT/LLM/TTSBSD 2-Clause3.20
ElevenLabs Conversational AITTS leader’s voice-agent productPer-minute on top of ElevenLabs voice costClosed3.20
OpenAI Realtime APIGPT-4o-mini-realtime endpoint underneath most vendorsGPT-4o realtime token ratesClosed3.20

Coding-agent reference (Section 3.21)

VendorWhat it isPricingLicense§
Devin (Cognition)Canonical autonomous-coding agent$500/mo team + ACU usageClosed3.21
Cursor AgentIDE-native agentCursor Pro $20/mo; Business $40/seatClosed3.21
Replit AgentIn-browser-IDE agent, runs on MastraReplit Core $25/moClosed3.21
LovableNo-code product builderPaid tiers from ~$20/moClosed3.21
GitHub Copilot WorkspacePlanning + execution next to GitHub reposBundled in Copilot Business $19/seat/moClosed3.21
opencodeOSS provider-agnostic terminal-native coding CLI, ~151K starsFreeMIT3.21
AiderOriginal OSS coding-agent CLI, Git-aware, ~44K starsFreeApache 2.03.21
OpenAI Codex CLIOpenAI’s official OSS twin to hosted Codex, ~78K starsFreeApache 2.03.21
ClineLeading VS Code-resident agent extension, ~61K starsFreeApache 2.03.21
ContinueVS Code agent extension, autocomplete + chat, ~33K starsFreeApache 2.03.21
GooseBlock (Square) Inc.’s SOC-2-aware OSS agent, ~43K starsFreeApache 2.03.21
PlandexPlan-first OSS agent for multi-file long-horizon tasks, ~15K stars (slowing)FreeMIT3.21
OpenHands (formerly OpenDevin)OSS Devin reference, sandboxed-VM coding+computer-use, ~72K starsFreeMIT3.21
Claude CodeAnthropic’s closed-source CLI counterpart to opencodeBundled with Anthropic API keyClosed3.21

Research-as-a-service (Section 3.22)

VendorWhat it isPricingLicense§
OpenAI Deep Research APIMulti-step web research with citations~$10 per 1K tool calls + token chargesClosed3.22
Perplexity Sonar APILower-cost web-research API with citations~$5 per 1K queries Sonar; Pro/Reasoning higherClosed3.22
You.com APIAI-search API with citationsComparable to PerplexityClosed3.22

Prompt management (Section 3.23)

VendorWhat it isPricingLicense§
PromptLayerPrompt-management-first UI for non-engineersFree 5K req; Pro $50/seat/mo; Enterprise customClosed3.23
HumanloopAnthropic-owned (2025) prompt+eval platformFree trial; production tier Enterprise customClosed3.23

Personal-agent runtimes (Section 3.24)

VendorWhat it isPricingLicense§
OpenClawMost-starred AI-agent project on GitHub; messaging-fronted local daemon, ~366K starsFreeMIT3.24
ClaworcMulti-instance OpenClaw manager, Docker + K8s-shaped, ~222 starsFreeApache 2.03.24
NemoClawNVIDIA’s enterprise-readiness layer over OpenClaw with NeMo Guardrails, ~20K stars (alpha)Free OSS; NVIDIA hardware tie-inNVIDIA-owned OSS3.24
AutoGPTOriginal autonomous-agent project; pivoted to AutoGPT Platform, ~184K starsFree OSS; Platform tier on cloudMIT / Polyform Shield3.24
Open InterpreterDesktop-agent with --os computer-use mode, ~63K starsFreeAGPL 3.03.24
AeonClaude-Code-on-cron pattern for engineer-productivity agentsFreeOSS3.24
SOUL.mdCommunity standard identity/personality/style markdown fileFreeMIT3.24

Appendix B — Glossary

A working glossary of acronyms, terms, and patterns used throughout the book.

Agent. A long-running LLM workflow that consumes events and produces side effects via tools. Distinguished from a chat completion by statefulness and tool use.

Agent farm. A fleet of agents sharing infrastructure (billing, observability, integration, durability, runtime). The unit of architectural decision in this book.

Agent loop. Perceive → reason → act → observe → loop or terminate. The five-step cycle every agent runs regardless of framework.

AgentCore. AWS’s decomposable agent infrastructure (Runtime, Memory, Gateway, Identity, Browser, Code Interpreter, Observability, Evaluations, Policy). Paired with the Strands SDK.

ASL (Amazon States Language). AWS Step Functions’ JSON-based state-machine definition language.

BUSL (Business Source License). A non-OSS license that converts to OSS after a delay. Used by Restate (BUSL 1.1 → Apache 2.0 after four years).

Cache read / cache write. Anthropic’s prompt-caching mechanism. Cache reads are billed at 10% of input rates; cache writes at 1.25× (5-min) or 2× (1-hr) of input rates. Relevant for long system prompts and tool definitions.

Camp one / camp two (runtime). The §2.2 distinction. Camp one: agent code runs anywhere (you own the process). Camp two: compute provider runs the agent (the vendor owns the process).

Checkpointer. LangGraph’s term for the durability layer that snapshots StateGraph state after every super-step.

CIMD (Client ID Metadata Documents). MCP 2025-11-25’s replacement for Dynamic Client Registration. Clients publish a JSON file at an HTTPS URL; the URL is the client ID.

Claude Code / claude -p. Anthropic’s CLI; -p is the headless print mode that the Agent SDK is the production version of.

Code Mode. Cloudflare’s MCP pattern (2026) where a server exposes search() and execute() and agents write small TypeScript snippets that run in a sandboxed isolate. Cuts input tokens by ~99.9% on large APIs.

CRIU (Checkpoint/Restore in Userspace). Linux process-snapshot/restore mechanism. Trigger.dev uses it to suspend and resume long-running tasks.

Durable execution. A runtime guarantee that workflow state is persisted to durable storage, every step is recorded before and after execution, and crashes resume exactly-once. Distinct from retry-with-backoff.

FastMCP. The dominant Python framework for building MCP servers. Roughly 70% of Python MCP servers use it.

GenAI semconv. OpenTelemetry’s GenAI semantic conventions. Standardised span attributes for LLM and agent observability (gen_ai.system, gen_ai.request.model, etc.). Still officially experimental but stabilising.

Hooks (Anthropic SDK). PreToolUse, PostToolUse, Stop, SessionStart interception points for deterministic policy enforcement.

iPaaS. Integration Platform as a Service. The umbrella term for Zapier, Tray, Workato, Make, n8n, etc.

MCP (Model Context Protocol). The Anthropic-originated, now-Linux-Foundation-governed protocol for exposing tools, resources, and prompts to agents. Spec 2025-11-25 is the current version.

OpenInference. Arize/Phoenix’s OTel-aligned instrumentation library. Emits GenAI semconv-compliant spans portable across observability backends.

OTel (OpenTelemetry). The CNCF observability standard. The base layer of the GenAI semconv hedge.

Pregel. The execution model LangGraph uses; processes nodes in super-steps with concurrent fan-out via Send.

Resource Indicators (RFC 8707). Required by MCP 2025-11-25. Tokens are bound to a specific MCP server and cannot be replayed against another.

Skill (Anthropic). A SKILL.md file with frontmatter (name, description, model, tools) and Markdown body. Open standard registered at agentskills.io. Loaded by the Agent SDK and claude -p from .claude/skills/*/.

SSPL (Server Side Public License). A non-OSS license used by Inngest’s server (with a delayed conversion to Apache).

StateGraph. LangGraph’s core abstraction. Typed channels with optional reducers, plus nodes that update state.

Stdio transport. MCP’s local-process transport (process-to-process via stdin/stdout). Used for local servers.

Streamable HTTP transport. MCP’s remote transport, with optional SSE for server-initiated messages. Replaces the deprecated HTTP+SSE two-endpoint pattern.

Strands. AWS’s Apache-2.0 agent SDK. Apache-licensed and runnable anywhere; pairs with AgentCore as the AWS hosted runtime.

Super-step. LangGraph’s unit of execution. Multiple node functions run concurrently in one super-step; checkpointers snapshot after each.

Tasks (MCP). An experimental MCP 2025-11-25 primitive for long-running operations.

Tier (Anthropic). API rate-limit tier. Tier 4 is the high-volume production tier most production agent farms target.

Virtual object (Restate). A keyed stateful actor with single-writer guarantees. The right primitive for “this deal owns a durable agent forever.”


Appendix C — Sources

The book is synthesised from twenty parallel research subagents plus a reviewer pass. The verbatim research outputs are saved at /Users/asac/Projects/hestiia/research/raw/.

Raw research files (the working bibliography).

  • 00-INDEX.md — Directory index with cross-references and gap list
  • 01-toc-cto-pov.md — CTO/decision-maker structure proposal
  • 02-toc-practitioner-pov.md — AI engineer structure proposal
  • 03-toc-skeptic-pov.md — Skeptical IT VP structure proposal
  • 04-mastra-deep-dive.md — Mastra
  • 05-langgraph-deep-dive.md — LangGraph + LangSmith
  • 06-anthropic-stack.md — Claude Agent SDK + Managed Agents + Skills
  • 07-inngest-agentkit.md — Inngest + AgentKit
  • 08-trigger-dev.md — Trigger.dev v4
  • 09-pydanticai-logfire.md — PydanticAI + Logfire
  • 10-openai-agents-sdk.md — OpenAI Agents SDK
  • 11-observability-roundup.md — Observability and eval platforms
  • 12-hyperscaler-platforms.md — AWS / Azure / Vertex / Cloudflare
  • 13-durable-execution.md — Temporal / Restate / DBOS / Hatchet / Prefect / Step Functions / Cloudflare Workflows
  • 14-multi-agent-frameworks.md — CrewAI / AutoGen / AG2 / MAF / Agno / LlamaIndex
  • 15-model-gateways.md — LiteLLM / OpenRouter / Portkey / Cloudflare AI Gateway / Vercel AI SDK / Kong
  • 16-low-code-platforms.md — n8n / Activepieces / Pipedream / Windmill / Make / Zapier / Lindy / Tray / Workato
  • 17-mcp-ecosystem.md — MCP spec 2025-11-25 / Smithery / Composio / Arcade / Cloudflare MCP / FastMCP / Pipedream Connect / mcp.run / mcp-use
  • 18-minor-frameworks.md — Smolagents / DSPy / Burr / Haystack / Atomic Agents / OpenHands / CAMEL-AI
  • 19-disaster-scenarios.md — Month-12 failure modes per framework + 90-day early warning signs
  • 20-futurist-2027-2028.md — Per-vendor 2-year projections + macro trajectory
  • 21-reviewer-gaps.md — Reviewer audit identifying 7 MUST-add categories and stale-claim flags
  • 22-oss-self-hosted-gap-analysis.md — Open-source / self-hosted layer-by-layer audit, Phoenix and Helicone factual corrections, the OSS-purist Hestiia stack costed at ~$32K/year vs. ~$17.5K/year for the recommended stack, and the model-vs-plumbing lock-in split that became §5.6 Risks 6a/6b
  • 23-mcp-manager-book-fit.md — Codebase-verified analysis of /Users/asac/Projects/hestiia/mcp-manager — 30K LOC, four pnpm packages, ten typed connectors, three OAuth-2.1 remote proxies — and the IP-vs-plumbing recategorisation that became the new §5.1 paragraph and the §3.15 sentence

Key external references cited in the book.