What is the difference between an LLM-based application and an agent?

An LLM-based application uses a model to perform a single task — summarise this text, classify this email, draft this response. An agent uses a model to plan and execute a multi-step workflow, typically by calling tools, observing results, and deciding what to do next. The architectural difference is statefulness: agents accumulate context across steps, which makes them more powerful and significantly harder to debug.

Should I use LangChain, LangGraph, or build my own agent framework?

For prototyping, established frameworks save time. For production at scale, most teams end up building their own thin orchestration layer because the failure modes they care about are too specific to be served well by a general-purpose framework. The right answer depends on how production-critical the system is and how much engineering capacity you have.

What is the most common reason agent projects fail in production?

Insufficient observability. When something goes wrong in an agentic system, the failure is usually emergent — a small drift in the model output cascades through several tool calls before producing the visible problem. Without complete trace logging, teams cannot reconstruct what happened, which means they cannot fix it.

How do you evaluate an agent's performance objectively?

Build a test set of real-world inputs paired with expected outputs or success criteria. Run the agent against this set on every meaningful change. Track pass rate, latency, and cost over time. The evaluation harness should be treated as a first-class deliverable, not an afterthought.

Why your agent stack should be boring before it's clever

Every fortnight someone in the industry releases a new agent framework. Each one promises autonomous reasoning, multi-step tool use, self-correction, and the kind of demo video that makes a CFO start asking when the company is going to "do AI agents." Most of them are genuinely impressive on the demo stage. Very few of them survive contact with a real production workflow.

This is not because the underlying models are bad. It is because the layer between the model and the user — the orchestration, the state management, the retry semantics, the observability — is treated as a creative space when it should be treated as plumbing. And plumbing, when it is done well, is profoundly boring.

The agents that ship and stay shipped are the ones built like distributed systems first and AI products second.

The demo-to-production gap is wider than most teams expect

A working demo of an agent typically requires a sunny path: a user types a clean request, the model picks a reasonable tool, the tool returns a clean response, and the model synthesises an answer. The whole thing takes eight seconds and looks like magic.

A real production agent has to handle: malformed inputs, ambiguous instructions, tool calls that time out, tool calls that succeed but return unexpected schemas, downstream APIs that go down, rate limits, retries that compound costs, model responses that occasionally invent tools that don't exist, context windows that overflow on the third tool call, users who cancel mid-execution, and the auditors who arrive six weeks later asking why a particular decision was made.

None of those problems are solved by picking a better orchestration framework. They are solved by treating your agent as a stateful distributed system and applying the engineering disciplines that distributed systems have required for forty years.

The boring foundations that actually matter

State management is not optional

An agent is, by definition, a stateful entity. It accumulates context. It makes decisions based on previous tool calls. It needs to remember where it is in a workflow. The temptation in early prototypes is to keep this state in-memory, in the language model's context window, or — worst of all — in both with no clear source of truth.

The first production discipline is to externalise state. Pick a durable store. Postgres works for most use cases. Redis works when you need lower latency on hot keys. Whatever you pick, the agent's working memory should live somewhere your team can query, replay, and audit. Our generative AI engagements almost always start by sketching the state model on a whiteboard before anyone touches the model API.

Retries are a design decision, not a default

Every tool call in an agent stack is a network operation that can fail. The naive approach is to wrap each tool call in a try/except and retry on failure. The slightly less naive approach is to add exponential backoff. The actually correct approach is to ask a different question entirely: what does idempotency mean in this context, and how do we encode it?

If your agent calls a tool that books a flight, and that tool times out, you do not want to retry. If it calls a tool that fetches a customer record and times out, you absolutely want to retry. The retry semantics for every tool in your stack should be an explicit design decision, documented somewhere a junior engineer can find them. Most teams discover this after the first incident.

Observability comes first, not last

Agentic systems are uniquely hard to debug because failures are often emergent. The model made a slightly odd decision three turns ago, which led to a slightly odd tool call, which returned a slightly odd response, which caused the final answer to be wrong. Without comprehensive trace logging — every prompt, every model response, every tool call, every tool response, with timing — debugging is essentially impossible.

Build the observability layer before you build the second tool. We have learned this the hard way enough times that our standard engagement now treats trace infrastructure as a first-week deliverable.

When clever is actually justified

None of this is an argument against sophisticated agent design. It is an argument for sequencing. The boring infrastructure has to exist before the clever orchestration is useful. Once you have:

Durable state that you can replay and audit
Tools with explicit, documented idempotency and retry semantics
Full trace observability with searchable indexes
An evaluation harness that catches regressions before they reach production
A clear escalation path for when the agent is uncertain

Then — and only then — you have earned the right to add cleverness. Multi-step planning, dynamic tool selection, self-correction loops, sub-agents, hierarchical reasoning: these are all genuinely useful techniques, but they amplify both the strengths and the failure modes of your underlying infrastructure. If the infrastructure is shaky, cleverness makes things worse, not better.

Cleverness amplifies infrastructure. Good infrastructure makes a clever agent reliable. Bad infrastructure makes a clever agent unpredictable.

A pragmatic sequence for shipping an agent

If we were starting an agent project today, the rough sequence would look something like this:

Weeks one and two. Define the workflow precisely. Identify the tools needed. Build a non-agentic version using deterministic code and a single LLM call. If this version is not useful, no amount of agentic sophistication will save it.
Weeks three and four. Externalise state into a durable store. Wrap every tool with explicit retry and idempotency semantics. Add structured logging.
Weeks five and six. Introduce the agent loop. Start with a single tool. Measure failure modes. Add the second tool only after the first is reliable.
Weeks seven and eight. Build the evaluation harness. Define quality metrics. Establish a baseline. From this point onward, no change ships without an eval delta.
Week nine and beyond. Add complexity carefully, one capability at a time, with the eval harness catching regressions at each step.

This sequence is slower than the alternative — which is to grab a framework, wire up six tools, and chase demos — but it produces systems that ship and stay shipped. We have used variations of this sequence on every meaningful AI engagement we have led, and the difference in long-term reliability is not subtle.

What this looks like in practice

For one of our retail clients, we spent the first six weeks of an agent project building exactly none of the agentic logic. We built a state store, instrumented six existing internal APIs with consistent retry semantics, designed the trace schema, and stood up an evaluation harness with a 200-example test set drawn from real historical workflows.

The actual agent loop came together in the seventh week and took three engineers four days to write. It worked first time, because every interesting failure had already been thought through in the infrastructure layer. Six months later, the system is still in production, still passing the same eval suite, and the team has added five new tools without any of them destabilising the original workflow.

That is what boring infrastructure buys you. Not slower delivery. Faster delivery, with less drama, at a lower cost to maintain. If you are building an agent right now, the most useful thing you can do this week is to stop adding tools and start writing down what happens when each one fails.

Work with us

Have a project that needs senior engineering attention?

We work with founders and enterprise teams across Dubai, the US, and India. If something here resonates with what you're building, we'd be glad to talk.

Start a conversation →

Why your agent stack should be boring before it's clever

The demo-to-production gap is wider than most teams expect

The boring foundations that actually matter

State management is not optional

Retries are a design decision, not a default

Observability comes first, not last

When clever is actually justified

A pragmatic sequence for shipping an agent

What this looks like in practice

Have a project that needs senior engineering attention?

Frequently asked questions

What is the difference between an LLM-based application and an agent?

Should I use LangChain, LangGraph, or build my own agent framework?

What is the most common reason agent projects fail in production?

How do you evaluate an agent's performance objectively?

Related reading

Have an ambitious idea? We'd love to hear it.

Why your agent stack should be boring before it's clever

The demo-to-production gap is wider than most teams expect

The boring foundations that actually matter

State management is not optional

Retries are a design decision, not a default

Observability comes first, not last

When clever is actually justified

A pragmatic sequence for shipping an agent

What this looks like in practice

Have a project that needs senior engineering attention?

Frequently asked questions

What is the difference between an LLM-based application and an agent?

Should I use LangChain, LangGraph, or build my own agent framework?

What is the most common reason agent projects fail in production?

How do you evaluate an agent's performance objectively?

Related reading

Evaluation harnesses are the deliverable

What we look for in retrieval pipelines

Our generative AI practice

Have an ambitious idea? We'd love to hear it.