Most Teams Are Still Pretending to Build With AI

March 26, 2026

There's a number that keeps coming up in every AI report right now: 11%.

That's the share of organizations using agentic AI in production, according to Gartner. Meanwhile, 30% are "exploring" and 38% are "piloting." Everyone has a demo. Almost nobody has shipped anything real.

We've spent the last year integrating AI into real products, including our own open source tools like tanam, a self-hosted CMS with AI ghostwriting built on Firebase Genkit, and genkitx-deepseek, our Genkit plugin for DeepSeek models. So when we read trend reports about where AI is heading, we tend to filter them through a different lens.

It stopped being about autocomplete a while ago

The real shift in 2026 isn't that models got smarter. It's that they can now reason across multi-step workflows: invoking tools, interpreting results, looping back, and completing tasks without a human watching over every step.

Gartner predicts 40% of enterprise apps will include task-specific AI agents by end of year, up from less than 5% twelve months ago. We're seeing that in our own work. Clients stopped asking "should we add AI?" somewhere around mid-last-year. The question now is "where exactly should the agent hand off to a human?" That's a much harder thing to answer well.

Because here's the thing: the engineering challenge isn't prompt engineering or model selection. It's system design. Where do you draw the line between what the agent decides on its own and what a person still needs to confirm? What happens when a workflow fails halfway through? How do you get any real visibility into a chain of model calls spread across multiple services? These are architecture problems. Teams that go in thinking they're solving an AI problem usually end up stuck and frustrated.

Smaller and more focused is winning

One shift that doesn't get talked about enough: smaller, purpose-built models are consistently outperforming large generalists for specific production use cases.

If you're building a feature with one clear job (classifying support tickets, pulling structured data out of messy input, summarizing transcripts), a focused model will be faster, cheaper, and more reliable than routing everything through the biggest model you can access. We built genkitx-deepseek partly because of this. DeepSeek's models punch well above their weight for code and reasoning tasks at a fraction of the cost. When you're optimizing for latency and cost in a real production environment, that gap adds up fast.

The question worth asking isn't which model scores best in a benchmark. It's which model works best for this specific job, within this latency budget, at this price point. Those are different questions.

You still need deterministic code around it

The teams building the most reliable AI features right now aren't running everything through a language model and crossing their fingers. They're combining LLMs with structured, deterministic systems: retrieval layers, validation logic, code that doesn't hallucinate.

We do this in tanam. The AI ghostwriting feature isn't a black box where content just appears. The model generates a draft, but publishing logic, schema validation, and user review are all handled by code that behaves predictably. That boundary is intentional, and it's what makes the feature trustworthy enough to put in front of users. Once you let a model own the whole pipeline end to end, things get unpredictable in ways that are hard to debug.

What we've seen go wrong

The pattern we see most often is teams building on hype timelines. The gap between "this looks great in a demo" and "this holds up for thousands of real users" is bigger than almost anyone budgets for, and closing it takes longer than expected.

The AI features that have shipped well in our projects share one thing: a tight, well-defined scope. The broader the automation, the harder it becomes to test, monitor, and recover from when something breaks. Starting narrow isn't a compromise. It's the smarter way in. And before you spend time designing what the AI does, it's worth spending at least as much time designing what happens when it's wrong, or slow, or uncertain. Teams that skip that conversation tend to find out the hard way, usually in production.

Where we think things are heading

2026 feels like the year AI stops being a feature and starts being closer to infrastructure. The teams treating it with the same discipline they'd bring to a database migration or an API contract are the ones building things that will hold up. The rest are still iterating on demos.

We've been building on Firebase, Genkit, Flutter, and Google Cloud long before any of this got the "agentic" label. That foundation hasn't changed. What's changed is how much of the work that used to need constant human attention can now be carefully delegated, and how much judgment it still takes to decide what's worth delegating.

If you're figuring out where AI fits in your product, we're happy to think through it with you.