AI Tool Pipelines — Automate Your WorkflowsAI Tool Pipelines

LLM integration

How to wire LLMs — cloud or local — into your existing systems without blowing the budget or leaking data. Streaming, function calling, RAG, and the gotchas nobody mentions until you ship.

Key takeaways

  • Use a local LLM (Ollama + Llama 3 or DeepSeek) for the easy 80% of tasks: classification, extraction, simple rewriting. Reserve cloud frontier models for genuine reasoning chains.
  • Stream LLM responses by default. Users perceive a 60-token/sec stream as faster than a 1.5s synchronous response of the same content.
  • Always propagate AbortSignal from the client to the LLM provider. Users browse away mid-answer; without abort, you pay for tokens they never read.
  • Function calling / tool use beats prompt-only "extract this JSON" by 5–10x in reliability. Always validate the args against a Zod / Pydantic schema before executing.

Frequently asked questions about this category

Should I use OpenAI, Anthropic, Google, or an open-source model?

Start with OpenAI behind a feature flag, ship in a week, measure for two. Add Anthropic when you need 200K-token context windows. Add Google Gemini for research with sourced citations. Switch to open-source (Llama, DeepSeek, Mistral) for tasks where data must stay on your network or per-call cost dominates.

What is the cheapest way to integrate an LLM into my app?

GPT-4o-mini at $0.15 / 1M input tokens covers most classification, summarisation, and extraction tasks. For everything cheaper, run Llama 3 8B on a $400 mini-PC with a 3060 GPU — marginal cost per call drops to electricity.

How do I stop my LLM from hallucinating?

You cannot eliminate it. You can reduce it with: retrieval-augmented generation (give the model your source documents in-context), schema-constrained outputs (function calling), self-critique loops (have a second LLM call grade the first), and confidence-gated fallbacks (route low-confidence answers to a human).

Is fine-tuning worth it, or is prompting enough?

Prompting + few-shot examples + RAG covers 95% of use cases as of 2026. Fine-tune only when (a) you have 1,000+ high-quality training examples, (b) the task is narrow and stable, and (c) latency or cost matters more than flexibility. Most "we should fine-tune" instincts are premature.

How do I handle LLM API failures gracefully?

Exponential backoff with jitter, capped at 2–3 retries, then a fallback model (e.g. Claude if OpenAI is down). Surface a "service degraded" message to users rather than waiting silently. Log every failure to a queue you can replay when the provider recovers.