Agentic Workflows That Work in Production (and 3 Anti-Patterns)

Why this post exists

Most agentic workflows that go viral never ship. The architecture looks clean in a Jupyter notebook, the demo runs flawlessly against a curated dataset, and then it hits a real user and falls apart inside of forty-eight hours. We've watched it happen enough times — both in our own builds and in client systems we've been brought in to stabilise — that we stopped being surprised.

This post is a companion to our Agentic Development 2026 piece, which covers the broader landscape. Here we get specific. Of the six agentic systems we shipped in production this past year, three patterns showed up consistently in the ones that held up. Three other patterns showed up consistently in the ones that didn't. We'll cover all six.

No hedging. No "it depends on your use case" cop-outs where they don't apply. These are opinions earned by running things in prod.

Pattern 1: Orchestrator + workers

You have a central LLM — the orchestrator — whose only job is to delegate. It receives the user's goal, breaks it into sub-tasks, and hands each sub-task to a specialised worker agent: one for research, one for code generation, one for summarisation, one for output formatting. Each worker has a narrow scope, a defined input schema, and a defined output schema. The orchestrator stitches results together.

Why it works: bounded scope per worker means each agent is easy to evaluate, easy to swap, and easy to debug when something goes wrong. "The code agent returned malformed JSON" is a tractable problem. "The agent did something unexpected" is not. Clear interfaces between orchestrator and workers give you that tractability.

When it earns its complexity: workflows with three to five genuinely distinct sub-tasks. If you can squint and see that your workflow is really just research → draft → review, this pattern fits. If you're forcing it onto a two-step workflow, you're paying orchestration overhead for no gain.

A rough sketch of what the orchestrator prompt looks like in practice: it receives the goal, emits a structured plan (JSON with task type, inputs, and expected output format for each step), then dispatches workers in sequence or parallel depending on dependencies. The workers never talk to each other. Only the orchestrator holds state.

We've shipped this successfully using Anthropic Claude as the orchestrator with smaller models (Haiku) as workers for high-volume steps. The cost profile is much better than running everything through Sonnet end-to-end.

Pattern 2: ReAct + bounded tools

Single agent. Small tool set — five to ten tools maximum. The agent thinks, then acts, then thinks again based on what it observed. Think-act-observe, looped until the goal is met or a stop condition fires. This is the ReAct pattern, and it's been around long enough that the tooling is mature.

Thought: I need to check the current pricing for this vendor.
Action: web_search("vendor X enterprise pricing 2026")
Observation: [search results]
Thought: The pricing page is behind a form. I'll try their docs instead.
Action: web_fetch("https://docs.vendor-x.com/pricing")
Observation: [page content]
Thought: I have the pricing. Now I can draft the comparison.
Action: draft_comparison(...)

Why it works: it's predictable, it's cheap relative to multi-agent setups, and it's debuggable. You can read the thought trace and understand exactly where it went wrong. LangChain and LlamaIndex both have solid ReAct implementations. LlamaIndex's ReActAgent with a custom tool set is where we've landed for most "do X with Y" workflows that don't need specialised sub-agents.

The constraint that makes it work is the tool count. Five to ten tools. Not twenty-five. Past roughly twelve tools in the context, LLMs start making worse tool selection decisions — not because they're incapable, but because the tool descriptions start competing and the signal degrades. More on this in anti-pattern 3.

When to use it: most single-goal workflows that need external data or action. "Research competitors and produce a one-pager." "Check our database for anomalies and summarise the findings." "Given this support ticket, look up the customer record, identify the issue, and draft a response." These are ReAct problems.

Pattern 3: Supervised pipelines

The LLM is one node in a deterministic graph. Before the LLM does anything, upstream steps validate and normalise the input. After the LLM returns output, downstream steps verify it — schema checks, business rule assertions, format validation — before anything goes to the user or the next system.

This is the most boring pattern on the list. It's also the most reliable one, by a wide margin.

LangGraph is the tool we reach for here. You define nodes (some deterministic Python, some LLM calls), edges between them, conditional routing based on output, and retry logic for verification failures. The graph is explicit. You can draw it on a whiteboard. You can test every node independently.

Why it works: you're not trusting the LLM to get it right on the first try without guardrails. The pre-LLM steps mean the model only ever sees clean, normalised input. The post-LLM steps catch the cases where it returned something technically coherent but factually wrong or structurally invalid. That second check is the difference between "the agent hallucinated a number and it propagated downstream" and "the agent hallucinated a number and we caught it before it touched the database."

When to use it: regulated industries, high-stakes decisions, any workflow where a wrong output has real downstream consequences. Healthcare data processing, financial document extraction, contract review pipelines. If you'd want a human to double-check it, build the double-check into the graph.

The overhead is real — more code to write, more nodes to maintain, more test coverage to write for each node. It's worth it every time.

Anti-pattern 1: Multi-agent debate for simple tasks

This one got popular after a run of Stanford and MIT papers showing that having multiple LLMs "debate" each other improves reasoning quality on benchmarks. It does, in some narrow conditions. What that research doesn't cover is what happens when you deploy it.

The pattern: spawn three agents with slightly different system prompts, have them each produce an answer, then have a "judge" agent synthesise the best response. For complex reasoning tasks where you need multiple perspectives, there's a real case for this. For most production tasks, it's theatre.

The cost is three to four times higher. The latency is roughly three times worse. The quality lift, when we've measured it on actual production workloads (not benchmarks), ranges from marginal to undetectable. A single well-prompted Claude Sonnet call with chain-of-thought beats the debate setup on most tasks we've tested — at a fraction of the cost.

Where it genuinely helps: adversarial red-teaming, edge-case exploration during development, tasks where you actually want diverse perspectives rather than a single best answer. That's a small fraction of production workflows.

The Stanford-paper-meets-production gap exists because papers optimise for benchmark performance and researchers don't pay the inference bill. You do.

Anti-pattern 2: Agent-of-agents recursion

The architecture looks like this: a top-level agent receives a goal, decides it needs to spawn a sub-agent to handle part of the work, and that sub-agent spawns another sub-agent for its own sub-tasks, and so on. In a demo with a tightly scripted goal and perfectly cooperative tools, this looks like emergent intelligence. In production, it looks like a bill.

Two things happen reliably as you go deeper into the recursion:

First, you lose signal. Each level of indirection introduces a new opportunity for the goal to get paraphrased, compressed, or subtly reinterpreted. By the time a leaf agent is working, it may be operating on a goal description that's three transformations away from what the user actually asked for. Errors compound. Hallucinated intermediate results get passed down as facts.

Second, debugging becomes near-impossible. "The final output was wrong" triggers an investigation where you're tracing execution across five levels of nested agent calls, each with their own context window, tool histories, and intermediate state. Good luck. Even with comprehensive logging — which most teams don't have because recursive agent spawning systems are hard to instrument — the trace is bewildering.

The OpenAI Swarm framework and some LangGraph patterns make it easy to build recursive agent trees. Ease of construction is not a signal that the architecture is sound. We've seen two client projects get abandoned specifically because the recursive agent setup became unmaintainable before it shipped.

If you think you need recursion, you probably need Pattern 1 instead.

Anti-pattern 3: "Just give it more tools and it'll figure it out"

The fifty-tool agent. Every possible action your system can take, crammed into a single tool manifest and handed to an LLM. "It's general-purpose. It can handle anything."

It can't, reliably.

LLMs degrade noticeably past roughly twelve to fifteen tools in the prompt. Not in a dramatic, catastrophic way — in a subtle, insidious way. The model starts picking tools that are almost right rather than exactly right. It starts combining tools in sequences that technically work but are inefficient. It occasionally hallucinates a tool that would have been useful, rather than using the closest available one. These failures are hard to catch because the outputs still look plausible.

The root cause is attention and context competition. Tool descriptions take up tokens. More tool descriptions mean less attention on the actual task. The model's ability to reason about which tool is correct degrades as the choice space grows.

The fix is Pattern 1: split the large tool set across specialised agents, each with five to ten tools matched to their specific scope. A research agent has search and fetch tools. A code agent has execution and linting tools. A data agent has database query and transform tools. The orchestrator routes to the right worker. No single agent drowns in options.

We've audited several teams running large-scale LlamaIndex deployments where the tool set had grown organically to thirty-plus entries. Splitting into specialised agents with bounded tool sets improved task completion accuracy by measurable margins every time. It's not magic — it's just not asking the model to pick the right tool from a crowd.

The production-shipping pattern is boring, bounded, and observable — not autonomous, emergent, and multi-agent. The latter makes great conference demos. The former pays rent. Clear scope per agent, limited tool surfaces, deterministic guardrails around LLM calls: that's where reliability comes from. Reveronix builds agentic systems for clients who need them to actually work — not to impress in a slide deck. If you're evaluating where agentic architecture fits your product, that's the conversation we're built for.

Agentic Workflows That Work in Production (and 3 Anti-Patterns)

Why this post exists

Pattern 1: Orchestrator + workers

Pattern 2: ReAct + bounded tools

Pattern 3: Supervised pipelines

Anti-pattern 1: Multi-agent debate for simple tasks

Anti-pattern 2: Agent-of-agents recursion

Anti-pattern 3: "Just give it more tools and it'll figure it out"

Ready to build something?

Keep reading

Agentic Development in 2026: What Actually Works in Production

The Eval-First AI Workflow: Why Most Teams Ship Blind

Building AI Agents With Human-in-the-Loop Fallbacks