Cyberax AI Playbook
cyberax.com
Explainer · Foundations

What "AI agents" actually are (and aren't)

The "AI agent" pitch is now applied to everything from a smart chatbot to a multi-step workflow tool to true autonomous software. What the term actually means in 2026, where the capability lives today, and how to evaluate the gap between agent demos and production-ready agents in the work you're considering buying.

At a glance Last verified · May 2026
Problem solved Cut through the "AI agent" marketing — what the term actually means in 2026, where capabilities really sit, and how to evaluate agent products against the work they claim to do
Best for Founders evaluating AI agent vendors, ops leaders deciding whether to buy, technical operators building agentic workflows
Tools ChatGPT, Claude, Manus, LangChain, AutoGen, Replit Agent
Difficulty Beginner
Cost Varies — agent products span free-tier consumer to $50,000+/year enterprise

An AI agent is AI that can take actions in sequence, not just generate one response. Instead of producing a single answer, an agent decides what to do next, calls a tool or an API (application programming interface — the way one piece of software calls another), reads the result, and chooses the next step. That’s the core idea. Everything else is a matter of how much the system actually decides on its own.

The label has been stretched. Vendors now apply “agent” to anything from a chatbot with one tool call, to a multi-step workflow runner, to a genuinely autonomous research system. When someone says “we built an AI agent for X,” the listener has no clear idea what was actually built. The marketing-versus-reality gap is the largest of any current AI category.

This piece is the honest taxonomy: the five rungs of what “agent” actually covers, where capability genuinely sits in 2026, and the questions that cut through vendor claims.

The mental model

The spectrum of what 'agent' covers

The label applies across a wide capability range:

  • Tool-using chatbots. A chatbot that can call one or two external functions (look up an order, schedule a meeting) is increasingly called an “agent.” This is the lightest version — a slight extension of conversational AI, not autonomous behaviour.

  • Multi-step workflow runners. A system that takes a goal (“send a follow-up email to the prospect after the call”) and runs through a defined sequence of steps. The sequence is mostly pre-determined; the AI fills in the language and small decisions. Tools like Zapier’s AI features, Make’s agent modules, and many “AI workflow” products fit here.

  • Tool-orchestrating agents. Systems that decide which tools to call and in what order based on the situation. Given a goal, they plan, call APIs, observe results, adapt. ReAct-style agents in LangChain, OpenAI’s function-calling-loop patterns, Claude with tool use. This is where the term most rigorously applies; the capability is real and increasingly production-ready for well-defined tasks.

  • Autonomous multi-step research / problem-solvers. Systems that handle open-ended goals — “research this company and produce a brief,” “investigate this customer issue and recommend an action.” These chain together many tool calls, manage long contexts, and self-correct. Products like Manus and the agent capabilities in Claude Sonnet 4.5+/Opus 4.7 are here. The capability is real but still bounded; multi-hour autonomous runs work sometimes and fail in interesting ways.

  • Truly autonomous software systems. Software that operates without human intervention on production work over long horizons. This is mostly future tense — demos exist, production deployments are narrow.

The vendor’s “agent” can be any of these. The customer’s expectation is usually the third or fourth; the product is usually the first or second. Hence the disappointment cycle.

Where agent capability genuinely sits in 2026

What works today, what doesn't

Works reliably: Tool-using chatbots, multi-step workflows with clear paths, tool-orchestrating agents on well-defined tasks (single-domain customer service, structured data extraction, defined research questions). These are production-deployed at scale across industries.

Works with care: Autonomous research and problem-solving on open-ended tasks within defined boundaries. The agent can produce useful work but needs human oversight on outputs; long-horizon runs are not reliably correct without intervention.

Mostly doesn’t work yet: Truly autonomous software handling consequential decisions over weeks. Demos exist; production at scale doesn’t. The agent layer can plan and execute but the failure modes are still substantial enough that consequential autonomy is rare in 2026.

How to evaluate vendor 'agent' claims

The questions that cut through marketing

When a vendor pitches an “agent” product, ask:

  • What does the agent actually decide? If the answer is “fills in language for a pre-defined workflow,” it’s a workflow runner with AI augmentation. If the answer is “picks tools and chains them based on the situation,” it’s a real orchestration agent. Both can be useful; understand which you’re buying.

  • What’s the failure rate on a real task? Don’t accept demo videos. Ask for the agent’s success rate on a representative task set; ask what failure modes look like. Honest vendors will share this; pitch-heavy vendors will deflect.

  • What’s the human-in-the-loop pattern? Most production agents need oversight on outputs. Where in the workflow does the human check? If “nowhere” is the answer, either the use case is trivial or the system is over-claiming.

  • How does the agent handle the long tail? The 80% of common cases work in demos; the 20% of edge cases are where production agents struggle. Ask specifically about the edge cases in your domain.

  • What’s the cost per run? Multi-step agent runs consume many LLM calls. Cost per task can be substantial at production volume. Get the realistic number, not the demo number.

The numbers

What agent deployments actually look like at scale

Typical cost per agent run (multi-step tool-using) $0.05–$5 per task depending on complexity
Success rate on well-defined tasks (single domain, clear goal) 80–95% in production
Success rate on open-ended tasks (research, broad reasoning) 50–75% — needs review
Failure mode at long horizon (multi-hour autonomous runs) Material; long autonomous runs are not yet reliably correct without intervention
Production-ready agent verticals (categories where it works today) Customer service triage, structured data extraction, defined research, sales qualification
Categories where agents are still experimental Open-ended product development, autonomous engineering, complex multi-party negotiations, regulated decision-making
Build vs buy tipping point Buy when the agent task is well-defined and a vendor product matches; build when the workflow is bespoke and you have engineering capacity

The “agent” framing is real in some categories and mostly aspirational in others. The bounded-capability framing — “this works for X, doesn’t yet work for Y” — beats both the hype and the dismissal.

What's next

Related work

For the specific case of AI agents in customer-facing roles, see AI agents for inbound qualification. For the live-chat AI framework, see Live-chat AI: when it works and when it actively hurts trust. For the broader hallucination risk in agentic systems, see AI hallucinations explained. For the open-source-vs-proprietary lens, see Open-source vs proprietary AI — practical tradeoffs.

Common questions

FAQ

Should we build our own agent or buy a vendor product?

Buy when your task matches a vendor's domain and the cost is reasonable; build when your workflow is bespoke enough that no vendor matches. Common pattern: buy for the standard tasks (customer support, lead qualification, meeting scheduling), build for your unique competitive workflows. Building is meaningfully more work than the demos suggest; budget conservatively.

Are AI agents replacing jobs?

Some, with caveats. The agents work well in narrow, well-defined task categories — those jobs see automation pressure. Open-ended, judgement-heavy, relationship-driven, high-stakes work is mostly safe through 2026 and beyond. The job-impact picture is uneven; honest scenarios involve transformation more than wholesale replacement.

How do I distinguish a real agent from a marketing-labelled chatbot?

Three questions. (1) Does it call tools? Real agents interact with systems. (2) Does it plan? Real agents decide multi-step sequences based on the situation. (3) Can it recover from failure? Real agents observe results and adapt; chatbots execute regardless. If a product fails one of these tests, it's a different category than what the agent label suggests.

What's the realistic horizon for autonomous agents handling consequential work?

Categories vary. Narrow, well-defined consequential work (specific medical decisions with strong oversight, structured trading within bounds, defined manufacturing-control tasks) is partially happening now. Open-ended consequential work (running a business autonomously, complex legal decisions) remains future-tense. The arc points toward more autonomy in defined domains; sweeping general-purpose autonomy is further away than vendor pitches suggest.

Sources & references

Change history (1 entry)
  • 2026-05-13 Initial publication.