Stop Building State Machines for Your AI Agents (use Durable Functions instead)

I built a sample that I think captures something important: AI agents that interact with the real world need workflows that pause, and Durable Functions make this much easier than current alternatives.

The Problem

Say you’re building a support agent. A customer asks for a refund. The agent can look up the order, check the return policy, and decide a refund is warranted — but it can’t just issue the refund. A human needs to approve it.

User in Teams:

Supervisor Dashboard:

So now you need to:

Save the pending request somewhere
Pause the workflow
Wait for a supervisor to approve or reject (could be hours or days)
Resume exactly where you left off
Process the refund and notify the customer

The typical approach? A state machine. You model every state (pending_approval, approved, processing, completed), every transition, and wire up polling or webhooks to detect when things change. You write a bunch of glue code to serialize context, handle edge cases, and coordinate between services.

It works. It’s also tedious, error-prone, and obscures what’s actually a simple workflow.

The Durable Functions Approach

Let’s start with the diagram. A customer asks the bot for a refund. The bot uses AI to look up the order, creates a case, and starts a Durable Functions orchestration that pauses until a supervisor approves or rejects it. Once approved, the orchestrator processes the refund and notifies the customer, all without polling or a state machine.

Sequence diagram showing the full refund workflow — from customer message through AI tool calling, Durable Functions orchestration, supervisor approval, and proactive notification

Here’s the entire approval workflow in my sample:

			
export const supportCaseOrchestrator: OrchestrationHandler = function* (context) {
  const { caseId, action } = context.df.getInput();
  // Mark as pending
  yield context.df.callActivity('updateCase', { caseId, status: 'pending_approval' });
  // Wait for a human — costs nothing while paused
  const approvalTask = context.df.waitForExternalEvent('Approval');
  const timeoutTask = context.df.createTimer(sevenDaysFromNow);
  const winner = yield context.df.Task.any([approvalTask, timeoutTask]);
  if (winner === approvalTask && approvalTask.result.approved) {
    yield context.df.callActivity('updateCase', { caseId, status: 'approved' });
    if (action === 'refund') {
      yield context.df.callActivity('issueRefund', { caseId });
    }
    yield context.df.callActivity('notifyBot', { caseId, message: 'Approved!' });
  } else {
    yield context.df.callActivity('updateCase', { caseId, status: 'rejected' });
    yield context.df.callActivity('notifyBot', { caseId, message: 'Rejected.' });
  }
};

		

That’s it. Read it top to bottom: it’s just the workflow. No state machine. No polling. No webhook plumbing. The orchestrator pauses at waitForExternalEvent, serializes its state, and stops executing entirely.

When a supervisor clicks “Approve” in the dashboard, the dashboard calls the Durable Functions HTTP API with:

raiseEvent('Approval', { approved: true })

passing the case ID. The framework matches this to the paused orchestration instance, deserializes its state, and resumes execution from the exact yield where it was waiting. The orchestrator then runs the remaining steps — update the case, process the refund, notify the customer — as if no time had passed.

Key: waitForExternalEvent costs nothing while waiting. No process running. No timer ticking. No compute billed. Each customer’s case gets its own orchestration instance, waiting independently.

Why This Matters for AI Agents

As we build agents that do more than just answer questions, agents that take actions, trigger workflows, and interact with external systems, we’re going to hit this pattern constantly:

Refund approvals: agent submits, human approves
Deployment requests: agent prepares a change, human confirms
Escalations: agent triages, human takes over
Multi-step processes: agent starts, waits for external data, continues

Every one of these is a “pause and wait” problem. You could solve each one with a state machine, a database, and some glue code. Or you could write the workflow as a straight-line function and let the infrastructure handle the rest.

What About the Alternatives?

Approach	How it works	Why it hurts
Polling loop	Bot checks a “pending” flag in a database every N seconds	Wastes compute. 1,000 pending cases = 1,000 polling loops. Latency depends on poll interval.
Queue + worker	Bot writes to a queue; worker picks up after approval	You build the state machine yourself: track which step each case is on, handle retries, deal with poison messages. “Wait for approval” doesn’t map naturally to a queue.
Webhook callback	Bot registers a callback URL; approval service calls it	Bot must be running when the callback arrives hours later. If it restarts, the callback URL may be stale. No built-in retry or state tracking.
Database + cron	Store pending cases in DB, cron job checks for approved ones	Same polling problem. Cron frequency = latency floor. State machine lives in application code. Error handling is manual.
Durable Functions	`waitForExternalEvent` pauses at zero cost; `raiseEvent` resumes instantly	Requires Azure Functions runtime. But: no polling, no state machine code, built-in retry, scales to thousands of concurrent cases.

Durable Functions win here because:

Zero-cost waiting: a case pending for 3 days uses no compute until approved
No state machine: the orchestrator reads like a sequential function, but the framework handles checkpointing, replay, and fault tolerance
Parallel independence: Alice’s refund and Bob’s escalation are separate instances; approving one doesn’t affect the other

The Full Sample

The durable-support-agent sample has three pieces:

A Teams bot that uses GPT-4o with tool calling to handle customer support — order lookups, knowledge base search, refund requests, escalations
Azure Durable Functions that orchestrate the approval workflow with zero-cost pausing
A Next.js dashboard where supervisors approve or reject pending cases

The whole thing runs locally. The bot creates cases, the orchestrator pauses, the dashboard lets you approve, and the customer gets notified, all coordinated through a workflow you can read in 30 lines.

If you’re building agents that need human-in-the-loop workflows, give Durable Functions a look.

Learn More

Azure Durable Functions overview — what they are and how they work
Human interaction pattern — the exact pattern used in this sample (waitForExternalEvent + raiseEvent)
Durable Functions for JavaScript/TypeScript — quickstart for the Node.js SDK
Orchestrator function constraints — rules for deterministic replay (important to understand before writing orchestrators)
Timers in Durable Functions — how createTimer works for timeouts and deadlines
durable-support-agent sample — the full source code for this post

Stop Building State Machines for Your AI Agents (use Durable Functions instead)

The Problem

The Durable Functions Approach

Why This Matters for AI Agents

What About the Alternatives?

The Full Sample

Learn More

Comments

Leave a comment Cancel reply

Stop Building State Machines for Your AI Agents (use Durable Functions instead)

The Problem

The Durable Functions Approach

Why This Matters for AI Agents

What About the Alternatives?

The Full Sample

Learn More

Share this:

Comments

Leave a comment Cancel reply