AI Tool Pipelines — Automate Your WorkflowsAI Tool Pipelines

Streaming LLM Responses in a React Frontend: A Complete Pipeline Guide

5 min read · Updated Jun 4, 2026

React chat interface showing streaming LLM text response in real time

A ChatGPT-style streaming UI in React is ~80 lines of code if you know which abstractions to skip. This guide ships a production version: server-sent events from a Next.js route handler, a React 19 client component that handles backpressure, abort-on-unmount (the part most tutorials skip and that costs you real money), retry on disconnect, and a markdown renderer that doesn’t flash-of-unstyled-content as tokens arrive. The Vercel AI SDK is the shortest path; raw fetch + ReadableStream is the version you should understand even if you ship the SDK.

Key takeaways

  • Always proxy LLM calls through your backend — a client-side API key gets scraped within hours.
  • Use the Vercel AI SDK’s useChat hook for 95% of cases. Drop to raw fetch + ReadableStream only if you need custom transport.
  • Propagate the AbortSignal all the way to the model. Unaborted streams are the #1 silent cost overrun in LLM apps.
  • Render markdown incrementally with a debounced parser (16ms) — reparsing every token is what makes naive implementations feel slow despite the tokens arriving fast.
  • Show a blinking cursor while streaming, swap to a "regenerate" button on done. This single UX detail is what makes it feel like ChatGPT.

Why streaming, in one paragraph

GPT-4o returns ~80 tokens/second. A 600-token answer = 7.5 seconds of generation. If you wait for the full response, your user stares at a spinner for 7.5s; if you stream, they see the first word in 200–400ms and the rest at a "natural reading pace." Same response, completely different perceived performance. Streaming is not a nice-to-have for LLM UIs; it’s the difference between "this app is broken" and "this app feels fast."

Backend route handler (Next.js App Router)

typescript
// app/api/chat/route.ts
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";

export const runtime = "edge";

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: openai("gpt-4o-mini"),
    messages,
    temperature: 0.7,
    abortSignal: req.signal, // <-- propagates client abort to OpenAI
  });

  return result.toDataStreamResponse();
}

Client component (the Vercel AI SDK version)

tsx
"use client";
import { useChat } from "ai/react";

export default function Chat() {
  const { messages, input, handleInputChange, handleSubmit, isLoading, stop } =
    useChat({ api: "/api/chat" });

  return (
    <div className="flex h-screen flex-col">
      <div className="flex-1 overflow-y-auto p-4 space-y-3">
        {messages.map((m) => (
          <div key={m.id} className={m.role === "user" ? "text-right" : ""}>
            <span className="inline-block max-w-[80%] rounded-lg p-3 bg-neutral-100">
              {m.content}
              {isLoading && m === messages.at(-1) && m.role === "assistant" && (
                <span className="inline-block w-2 h-4 bg-neutral-500 ml-1 animate-pulse" />
              )}
            </span>
          </div>
        ))}
      </div>
      <form onSubmit={handleSubmit} className="border-t p-4 flex gap-2">
        <input
          value={input}
          onChange={handleInputChange}
          placeholder="Ask anything…"
          className="flex-1 rounded-md border px-3 py-2"
          disabled={isLoading}
        />
        {isLoading ? (
          <button type="button" onClick={stop} className="rounded-md bg-red-500 px-4 text-white">Stop</button>
        ) : (
          <button type="submit" className="rounded-md bg-primary-500 px-4 text-white">Send</button>
        )}
      </form>
    </div>
  );
}

Client component (the raw <code>fetch</code> version)

If you can’t use the SDK — React Native, non-Next stack, custom transport — here’s the no-deps version. Note the explicit AbortController, the TextDecoder stream flag (critical for multi-byte UTF-8), and the cleanup-on-unmount.

tsx
"use client";
import { useEffect, useRef, useState } from "react";

export default function ChatRaw() {
  const [messages, setMessages] = useState<{ role: string; content: string }[]>([]);
  const [input, setInput] = useState("");
  const [streaming, setStreaming] = useState(false);
  const controllerRef = useRef<AbortController | null>(null);

  useEffect(() => () => controllerRef.current?.abort(), []);

  async function send() {
    if (!input.trim()) return;
    const userMsg = { role: "user", content: input };
    const next = [...messages, userMsg, { role: "assistant", content: "" }];
    setMessages(next);
    setInput("");
    setStreaming(true);

    const ctrl = new AbortController();
    controllerRef.current = ctrl;

    try {
      const res = await fetch("/api/chat", {
        method: "POST",
        body: JSON.stringify({ messages: next.slice(0, -1) }),
        headers: { "content-type": "application/json" },
        signal: ctrl.signal,
      });
      if (!res.body) throw new Error("no body");
      const reader = res.body.getReader();
      const decoder = new TextDecoder();
      let acc = "";
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        acc += decoder.decode(value, { stream: true });
        setMessages((m) => {
          const copy = [...m];
          copy[copy.length - 1] = { role: "assistant", content: acc };
          return copy;
        });
      }
    } catch (err: unknown) {
      if ((err as Error).name === "AbortError") return; // user cancelled, no-op
      console.error(err);
    } finally {
      setStreaming(false);
      controllerRef.current = null;
    }
  }

  return (/* same JSX as the SDK version */ null);
}

The story that taught me to ship abort on day one

October 2024, Thursday. A 4-person edtech I was advising had launched a tutoring chat for SAT prep. GPT-4o-mini, streaming, ~80-token average answer. Worked beautifully in the demo. Two weeks after launch they messaged me: "our OpenAI bill is 3.4x what we modelled, what’s wrong?" We dug into the logs. Average response length: matched the model. Active users: matched. Then I noticed something odd in the OpenAI usage dashboard: total tokens output was 2.1x what we’d "delivered" to the UI based on our DB logs. The streams were continuing after the user left the page. They’d shipped the streaming UI without AbortSignal propagation — the React component cleaned up its own state, but the underlying fetch kept consuming the stream, and the OpenAI request kept generating, until the model hit max_tokens. On a tutoring app, students browsed away mid-answer constantly (read first line, "yeah I get it", new tab). The fix was 6 lines: AbortController in a ref, abort on unmount, pass req.signal from the route handler to the OpenAI client. Deployed Wednesday. Following week’s bill came in at 38% of the previous week. Six lines, ~$1,800/month saved. I now ship that abort path on day one of every streaming UI, before I ship the actual streaming.

Production guardrails

  • Rate-limit at the backend. Use Upstash Ratelimit or a Redis token-bucket — streaming endpoints attract abuse faster than any other route.
  • Cap max_tokens. Always set a server-side ceiling (e.g., 1500). Even with abort, a misbehaving prompt can otherwise generate forever.
  • Log token usage per request. Use OpenAI’s onFinish callback in the AI SDK to write { user_id, input_tokens, output_tokens, cost } to Postgres. Without this you can’t catch the abort bug above.
  • Render markdown debounced. Re-parsing markdown on every token causes layout thrash. Debounce the parser at 16ms (one frame); the stream still feels live and CPU drops 70%.
  • Handle the disconnect. On network drop mid-stream, show "connection lost — retry" instead of leaving the half-message and a dead cursor. Use the stream’s onError.

The opinion I will defend

“A streaming UI without abort propagation is a feature that pays OpenAI to do work nobody asked for.”

Frequently asked questions

Frequently asked questions

Why not just call OpenAI directly from React?

Because your API key would be embedded in client-side JavaScript and scraped within hours. Always proxy through a backend route. The backend also lets you rate-limit, log token usage per user, and enforce max_tokens — none of which is possible client-side.

What’s the difference between SSE and streaming fetch?

Server-Sent Events are a specific text/event-stream protocol with auto-reconnect and a defined message format. The Vercel AI SDK uses a custom data-stream protocol over chunked HTTP, not strict SSE. For most apps the difference doesn’t matter — the SDK handles the transport. If you need strict SSE (e.g., for an EventSource consumer), use OpenAI’s native stream format directly.

How do I render markdown while streaming?

Use react-markdown with a 16ms debounce on the parsed output. Re-parsing on every token causes layout thrash; debouncing keeps the stream feeling live while dropping CPU usage ~70%. If you need code-block syntax highlighting, lazy-load Shiki/Prism only after the stream completes.

What about React Native?

The Vercel AI SDK has a React Native package (ai/react-native) that uses a fetch polyfill supporting streaming. Without it, raw fetch on React Native doesn’t support streaming bodies prior to recent New Architecture versions — fall back to chunked polling if you must support older RN.

How do I show typing indicators / loading states?

Three states: idle (input enabled, send button), waiting-for-first-token (input disabled, spinner where the assistant message will appear), streaming (input disabled, blinking cursor at end of latest token). Don’t use a generic spinner during streaming — the cursor is the affordance users have learned from ChatGPT.

How do I handle disconnects mid-stream?

Wrap the stream consumer in try/catch, distinguish AbortError (user cancelled — no-op) from NetworkError (genuine disconnect — show retry button). On retry, resend the entire conversation including the partial assistant response; tell the model in the system prompt to continue from where it left off. Or simpler: discard the partial and restart — the user almost always prefers a fresh, complete answer.