Stream Local LLM Responses to React with Ollama and Server-Sent Events

5 min read · Updated Jun 4, 2026

To stream a local LLM into a React app, run Ollama on localhost, expose its /v1/chat/completions endpoint with stream: true through a thin server route, and consume the Server-Sent Events stream in the browser with the native EventSource API or fetch with a ReadableStream reader. About sixty lines of code, no extra dependencies.

Key takeaways

Ollama serves an OpenAI-compatible /v1/chat/completions endpoint — same client code as cloud, just point baseURL at localhost:11434.
Always proxy through a server route (Next.js or Express) — never expose Ollama directly to the browser; it has no auth.
Use fetch + ReadableStream rather than EventSource so you can send custom headers (auth) and propagate AbortSignal on cleanup.
Tokens/sec on consumer hardware: ~25–40 for an 8B model on Apple Silicon, ~60–80 on RTX 3060+, vs ~80 for cloud GPT-4o-mini — perceptually similar.
Propagate the abort signal to the Ollama request on component unmount, or you waste GPU cycles after the user closes the tab.

Why SSE and not WebSockets

SSE (Server-Sent Events) is one-way, text-based, runs over plain HTTP, reconnects automatically, and was designed for exactly this: server pushes tokens to a browser. WebSockets are the right tool when you also need the client to send messages mid-stream. For LLM completions, you do not. Pick the simpler thing.

The afternoon a typewriter effect saved a product

April 2024, I was helping a friend named Alex with a small writing-assistant tool. The model answered in 6 to 9 seconds. Users would type a prompt, see a spinner, and tab away. Bounce rate around 41% on the answer page. We added streaming on a Tuesday. By Friday the bounce rate on the same page was 17%. The model was no faster. Nothing else changed. People will wait six seconds if they can see something happening, and they will leave in two if they cannot. Perceived performance is performance.

The opinion

If your LLM response takes more than 800 milliseconds, you should be streaming. Full stop. The mechanism: the human brain reads a stream of tokens and reacts to the first ones while later ones are still generating, which collapses subjective wait time. The cost of being wrong: users assume the app is broken. The only reasons to skip streaming are when you need the full JSON before you can do anything with it (function calling, structured extraction), or when your output is genuinely under that threshold.

Laptop showing a chat interface where tokens stream into the message bubble one at a time

The server route

In a Next.js route handler or an Express endpoint, POST to http://localhost:11434/v1/chat/completions with stream: true in the body. The response is a ReadableStream of chunks in OpenAI's SSE format: lines starting with "data: " followed by a JSON object containing the delta. Pipe that stream straight to your client response with the headers Content-Type: text/event-stream, Cache-Control: no-cache, and Connection: keep-alive. No transformation needed; the format already matches what the browser expects.

The React side

useState for the accumulating text, useRef for the AbortController so you can cancel. Use fetch with a streaming body reader, not EventSource, because EventSource cannot POST. Loop with reader.read(), decode each chunk with TextDecoder, split on \n\n, strip the "data: " prefix, JSON.parse, append delta.content to state. The whole component fits in a single screen.

Numbers from a 3060

Llama 3.1 8B at 4-bit quantisation on an RTX 3060 streams roughly 45 tokens per second in my setup as of August 2025. Time to first token is about 180 milliseconds on warm cache. For an average 200-token answer, the user sees text begin appearing in under a quarter of a second and the full answer in about four and a half seconds. Mistral 7B is slightly faster, Phi-3 Mini is roughly twice as fast on the same hardware.

The cancellation problem

Users will click stop. They will navigate away. They will close the tab. Without proper cancellation, your Ollama server keeps generating tokens nobody is reading, which costs GPU cycles and blocks the next request. Wire the AbortController on the client and propagate the cancellation to the upstream fetch on the server. Ollama drops the generation as soon as the connection closes. Verify this by opening Ollama's logs and watching for the cancel message; if you do not see it, your proxy is buffering.

Proxy buffering, the silent killer

Nginx buffers responses by default. So does some CDN configurations. You will spend a confused hour wondering why the stream arrives in one chunk at the end. Set proxy_buffering off in your Nginx server block for the streaming route. On Cloudflare, the Pro plan or above is required for proper SSE pass-through; the free plan buffers.

The CORS detail every Ollama-in-the-browser guide buries

If you ever call Ollama directly from the browser during development, you will hit a CORS wall. The fix, documented at docs.ollama.com/api/streaming and laid out in detail in the ML Journey guide "How to Use Ollama in a React or Next.js App", is to set OLLAMA_ORIGINS to the origins you trust, for example OLLAMA_ORIGINS="http://localhost:3000" before launching the Ollama server. In production you should not call Ollama from the browser at all; proxy it through a Next.js route handler so your model URL never ships to a client. Pavel Espitia's dev.to piece "Streaming Ollama Responses in Next.js: The SSE Pattern That Actually Works" walks through that proxy pattern on the App Router. Same shape as what I described above, with one extra useful detail: pass through the AbortSignal so cancellations propagate to Ollama and free the GPU.

Frequently asked questions

Can I do this without a backend route?

Only in local development. In production you almost never want the browser hitting Ollama directly because it exposes your model server to the public internet. The thin server route also lets you add auth, rate limiting, and request logging.

How do I render Markdown as it streams?

Re-render the accumulated text through a Markdown component (react-markdown is fine) on every chunk. The cost is small for short answers. For very long answers, debounce the re-render to every 50 milliseconds or every 5 chunks to avoid layout thrash.

What about React Server Components?

RSC can stream UI from the server, but for a chat-style typewriter effect the client component approach is simpler and more flexible. Use RSC for the surrounding layout and a client component for the streaming message bubble.

Will this work with the AI SDK from Vercel?

Yes. The Vercel AI SDK's useChat hook accepts an Ollama-compatible endpoint through its OpenAI-compatible provider. You get streaming, cancellation, and message history with about ten lines of glue.

Build the stream on a quiet afternoon. Watch the first token appear in 200 milliseconds. Try to go back to spinners. You will not.