Building an AI sales agent for Shopify with Claude and the AI SDK

Most Shopify storefronts have a conversion gap that nobody really talks about. A shopper lands on a product page, has a question (will this fit?, does it ship to my country?, do you have it in black?), and there is nobody there to answer. The default chat widget is usually a glorified "leave us a message" form, which is fine for support tickets but useless for a buying decision that has a five-minute window before the tab closes.

I wanted something that could actually behave like a salesperson on the floor. Greet the shopper, search the catalog, pull up an order, push a code when it makes sense, and quietly tag the lead for follow-up if they bounce. That became Chatflo, a Shopify app I shipped on Claude Sonnet 4.6 using the Vercel AI SDK and Cloud Run.

This post is a tour of the parts that mattered.

The shape of the stack

The whole thing is a single React Router app (forked from Shopify's React Router template), running in a container on Cloud Run, talking to:

  • Claude Sonnet 4.6 via @ai-sdk/anthropic for the agent loop, and Haiku 4.5 for cheap follow-up suggestions.
  • Shopify Admin GraphQL for product search, order lookups, and minting real discount codes.
  • Postgres + Prisma for chatbot config, conversation state, leads, and per-shop usage counters.
  • Server-Sent Events (SSE) to stream tool calls and token deltas to a theme app extension on the storefront.

The interesting part isn't the framework choice. It's how the agent loop, the Shopify tools, and the streaming surface all fit together without melting the bill or the latency budget.

The agent loop is just streamText with stop conditions

A quick note on naming, because this trips people up: the "AI SDK" I keep referring to is Vercel's AI SDK (the ai package), not Anthropic's Claude Agent SDK. They solve overlapping problems from different ends. Vercel's SDK is a provider-agnostic streaming primitive: you bring your own tools and your own loop, and it gives you a clean stream of tool calls and text deltas. Anthropic's Agent SDK is "Claude Code as a library": opinionated, batteries included (file I/O, bash, MCP, subagents), and excellent if you're building something Claude-Code-shaped. For a shopper-facing chatbot with custom Shopify tools and no filesystem in sight, Vercel's SDK was the cleaner fit.

I went back and forth between writing a hand-rolled tool loop on top of the Anthropic SDK and just using the AI SDK's streamText. The AI SDK won because it gives you three things that are tedious to build correctly: multi-step tool execution, a normalized stream you can fan out, and provider-agnostic prompt caching hooks.

The core of runAgent is roughly this:

import { streamText, stepCountIs, hasToolCall } from "ai";
import { aiAnthropic, DEFAULT_AGENT_MODEL } from "./ai.server";
 
const result = streamText({
  model: aiAnthropic(DEFAULT_AGENT_MODEL),
  messages: [systemMessage, ...withMessageCaching(history, model)],
  tools: withToolCaching(tools, model),
  temperature: chatbot.temperature,
  maxOutputTokens: chatbot.maxTokens,
  stopWhen: [
    stepCountIs(MAX_STEPS),
    hasToolCall("show_lead_form"),
    hasToolCall("soft_capture_email"),
  ],
});

stopWhen is the part I wish I'd discovered sooner. By default the SDK will keep looping until the model stops calling tools, which for a sales agent is exactly not what you want when a UI tool fires. If the model calls show_lead_form, the next step shouldn't be more text; the conversation should pause and wait for the shopper to submit the form. hasToolCall lets you express that as data instead of as a hand-written state machine.

The same pattern works as a safety rail. stepCountIs(5) caps a runaway tool loop at five steps, which is plenty for "search → narrow → recommend → add to cart" and short enough that a confused model can't burn through a shop's monthly token cap in a single turn.

Tools are where the product actually lives

The model on its own is a chatbot. The tools are what make it a salesperson. Chatflo registers ~10 tools per conversation, gated by what the merchant has turned on:

  • search_products, recommend_products, browse_collections — read-only catalog access.
  • add_to_cart — returns a payload the storefront client uses to call /cart/add.js.
  • lookup_order, get_my_orders, get_customer_profile — only registered when the shopper is signed in (verified via Shopify App Proxy HMAC, more on that below).
  • show_lead_form, soft_capture_email — UI tools that halt the loop and ask the shopper for an email.
  • give_discount — mints a real, single-use Shopify discount code via discountCodeBasicCreate.
  • request_human_handoff — flags the conversation for the merchant's inbox.

Each tool is a small object with a Zod schema and an execute function:

const giveDiscount: ToolDefinition = {
  name: "give_discount",
  description:
    "Issue a discount code to the shopper. Gated: only call when the shopper has hesitated about price OR has been engaged for 5+ turns without converting, AND an email has been captured. Never offer unprompted on the first message.",
  schema: z.object({
    offer_id: z.string().optional(),
  }),
  execute: async (args, ctx) => {
    const offers = ctx.sales?.offers ?? [];
    if (offers.length === 0) {
      return {
        ok: false,
        error: "no_offers_configured",
        summary: "No offers configured",
      };
    }
    const hasEmail = await emailCapturedForConversation(ctx.conversationId);
    if (!hasEmail) {
      return {
        ok: false,
        error: "gate_no_email",
        summary: "Discount blocked — capture an email first",
      };
    }
 
    const target = offers[0];
    const minted = await mintDynamicDiscountCode(ctx, target);
    // ... persist conversion event, return the live code
  },
};

Two things worth pulling out from this:

The first is that the gates live in the tool, not in the prompt. You can write "only offer a discount after capturing an email" into the system prompt and the model will mostly listen, but "mostly" is not what you want when the alternative is bleeding margin. By returning gate_no_email from the tool itself, the model gets a clear, machine-readable signal that the offer can't go out yet, and the tool can never fire prematurely no matter how the prompt drifts.

The second is that give_discount calls Shopify's discountCodeBasicCreate mutation and returns a real, redeemable code. The shopper sees CHAT-7K2P9X in the chat, copies it to checkout, and Shopify honors it because it actually exists. That sounds obvious in retrospect, but the lazy version (let the model invent a code and pray) is what most AI chatbot demos ship with.

The prompt is mostly a wrapper around merchant config

The system prompt is built per-turn from the merchant's onboarding config: business profile, tone, FAQs, configured offers, objection-handling notes, and social proof quotes. This is also where the live page context goes — the URL the shopper is on, the current product, what's in their cart — so the model doesn't have to ask "what are you looking at?" when it's literally in the request payload.

I leaned hard on Anthropic's prompt caching to keep this affordable. The AI SDK exposes provider options on every message, which means you can mark specific blocks for caching:

const systemMessage: ModelMessage = {
  role: "system",
  content: system,
  providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } },
};

There's a four-breakpoint limit per request. I use three: the system prompt, the last tool definition (which caches the entire tool block), and the last message in history (which caches the conversation prefix incrementally as turns accumulate). The helpers look like this:

export function withMessageCaching(
  messages: ModelMessage[],
  model: LanguageModel
): ModelMessage[] {
  if (!isAnthropicModel(model)) return messages;
  if (messages.length === 0) return messages;
 
  return messages.map((message, index) =>
    index === messages.length - 1
      ? {
          ...message,
          providerOptions: {
            ...message.providerOptions,
            ...ANTHROPIC_CACHE,
          },
        }
      : message
  );
}

On a typical 4-turn conversation with ~3k tokens of system prompt and tool definitions, this drops the per-turn input cost by roughly 80%. The 5-minute TTL is short, but a shopper who is actively typing trips it on every turn, which is exactly the case you want to optimize for.

Streaming through Shopify's App Proxy

The storefront chat widget can't talk to my Cloud Run instance directly. Shopify proxies the request through /apps/chatflo/chat so it stays on the merchant's domain. Every proxied request is signed with HMAC, and the logged_in_customer_id query param is only trustworthy after you verify that signature.

export const action = async ({ request }: ActionFunctionArgs) => {
  const url = new URL(request.url);
  const secret = process.env.SHOPIFY_API_SECRET ?? "";
 
  if (!verifyAppProxySignature(url, secret)) {
    return unauthorized("bad_signature");
  }
 
  const { shop, loggedInCustomerId } = readAppProxyContext(url);
  // ... safe to use loggedInCustomerId for order lookups now
};

Once the request is verified, the agent runs as an async generator and each event is pushed down the wire as SSE:

export function agentEventStream(
  gen: AsyncGenerator<unknown>
): ReadableStream<Uint8Array> {
  const encoder = new TextEncoder();
  return new ReadableStream({
    async start(controller) {
      try {
        for await (const ev of gen) {
          const line = `data: ${JSON.stringify(ev)}\n\n`;
          controller.enqueue(encoder.encode(line));
        }
      } finally {
        controller.enqueue(encoder.encode("data: [DONE]\n\n"));
        controller.close();
      }
    },
  });
}

The events are typed: text_delta, tool_start, tool_end, tool_data, ui_component, suggestions, done, error. The widget renders them differently: tool starts become little badges ("searching products..."), tool_data for search_products becomes a swipeable product carousel, ui_component swaps the input out for an inline lead form. Everything is incremental, so the shopper sees the agent doing things rather than staring at a typing indicator while the model burns through a tool loop in the background.

Cloud Run was the boring, correct choice

I tried Vercel first. It works, but a streaming agent with 30-60 second tool loops fights the serverless model. Cold starts compound with first-token latency, and you end up paying for connection time you didn't budget for. Cloud Run gives you a long-lived container, generous concurrency per instance, and a simple Dockerfile contract.

The Dockerfile is a stock multi-stage Node 22 build:

FROM node:22-alpine AS deps
RUN apk add --no-cache openssl
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --legacy-peer-deps
 
FROM node:22-alpine AS build
RUN apk add --no-cache openssl
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npx prisma generate
RUN npm run build
 
FROM node:22-alpine AS runtime
WORKDIR /app
ENV NODE_ENV=production
COPY package.json package-lock.json ./
RUN npm ci --omit=dev --legacy-peer-deps && npm cache clean --force
COPY --from=build /app/build ./build
COPY --from=build /app/node_modules/.prisma ./node_modules/.prisma
COPY --from=build /app/node_modules/@prisma ./node_modules/@prisma
 
ENV PORT=8080
EXPOSE 8080
CMD ["npx", "react-router-serve", "./build/server/index.js"]

A few things I learned the hard way:

  • openssl is not in node:22-alpine by default. Prisma needs it; the build silently produces a broken client without it.
  • Cloud Run injects $PORT. react-router-serve already respects it, but if you hardcode 3000 anywhere, the health check will hang and the deploy will roll back with a vague error.
  • Set min instances to 1 for the chat path. Cold starts on the agent route are painful. The shopper is right there, watching. The cost of one always-on instance is a rounding error compared to losing a sale because the first token took eight seconds.
  • Concurrency 80 works fine. The agent is mostly I/O-bound (Anthropic, Shopify GraphQL, Postgres) so a single small instance handles a lot of conversations.

Deploys are gcloud run deploy chatflo --source . from CI. Secrets live in Secret Manager and get mounted as env vars.

What I'd do differently

Three things, in order of regret:

Start with the eval harness, not the agent. I spent two weeks tweaking prompts before I had a way to measure whether tweaks actually made the agent better. A lightweight harness that replays 50 canned conversations and diffs tool-call sequences would have saved me from a lot of vibes-based prompt engineering.

Treat the conversation state machine as a first-class concept. The agent's stage — greeting, qualifying, recommending, closing — ended up scattered across the prompt, the tool gates, and an intent classifier. I should have modeled it explicitly from day one and let the agent transition through it via a tool, the way the lead-form halt already works.

Cache the tool block aggressively from the start. I left ~40% on the table for the first month because I assumed prompt caching only mattered for the system prompt. The tool definitions are usually the largest stable prefix in an agent request. Cache them.

Closing

The thing that surprised me most about building on Claude through the AI SDK is how little glue code there ended up being. The agent loop is a single streamText call. The "is this a real salesperson" feel comes from the tools and the gates, not from clever prompting. Once the tool surface is right, the model is mostly trying to be helpful in the direction you've already pointed it.

If you're looking at a similar build — an embedded agent on someone else's platform, with real side effects and real money on the line — the parts that took the most thought were the boring ones: HMAC verification, idempotent single-flight per conversation, PII redaction before persisting, gating discounts behind email capture. The model is the easy part now. The infrastructure that makes it safe to deploy is where the work is.

Chatflo is live on the Shopify App Store. If you're a merchant, you can install it. If you're building something similar and want to compare notes, my DMs are open.