What "AI agents" actually are (and aren't)

An AI agent is AI that can take actions in sequence, not just generate one response. Instead of producing a single answer, an agent decides what to do next, calls a tool or an API (application programming interface — the way one piece of software calls another), reads the result, and chooses the next step. That’s the core idea. Everything else is a matter of how much the system actually decides on its own.

The label has been stretched. Vendors now apply “agent” to anything from a chatbot with one tool call, to a multi-step workflow runner, to a genuinely autonomous research system. When someone says “we built an AI agent for X,” the listener has no clear idea what was actually built. The marketing-versus-reality gap is the largest of any current AI category.

This piece is the honest taxonomy: the five rungs of what “agent” actually covers, where capability genuinely sits in 2026, and the questions that cut through vendor claims.

The mental model

The spectrum of what 'agent' covers

The label applies across a wide capability range:

Tool-using chatbots. A chatbot that can call one or two external functions (look up an order, schedule a meeting) is increasingly called an “agent.” This is the lightest version — a slight extension of conversational AI, not autonomous behaviour.
Multi-step workflow runners. A system that takes a goal (“send a follow-up email to the prospect after the call”) and runs through a defined sequence of steps. The sequence is mostly pre-determined; the AI fills in the language and small decisions. Tools like Zapier’s AI features, Make’s agent modules, and many “AI workflow” products fit here.
Tool-orchestrating agents. Systems that decide which tools to call and in what order based on the situation. Given a goal, they plan, call APIs, observe results, adapt. ReAct-style agents in LangChain, OpenAI’s function-calling-loop patterns, Claude with tool use. This is where the term most rigorously applies; the capability is real and increasingly production-ready for well-defined tasks.
Autonomous multi-step research / problem-solvers. Systems that handle open-ended goals — “research this company and produce a brief,” “investigate this customer issue and recommend an action.” These chain together many tool calls, manage long contexts, and self-correct. Products like Manus and the agent capabilities in Claude Sonnet 4.5+/Opus 4.7 are here. The capability is real but still bounded; multi-hour autonomous runs work sometimes and fail in interesting ways.
Truly autonomous software systems. Software that operates without human intervention on production work over long horizons. This is mostly future tense — demos exist, production deployments are narrow.

The vendor’s “agent” can be any of these. The customer’s expectation is usually the third or fourth; the product is usually the first or second. Hence the disappointment cycle.

Where agent capability genuinely sits in 2026

What works today, what doesn't

Works reliably: Tool-using chatbots, multi-step workflows with clear paths, tool-orchestrating agents on well-defined tasks (single-domain customer service, structured data extraction, defined research questions). These are production-deployed at scale across industries.

Works with care: Autonomous research and problem-solving on open-ended tasks within defined boundaries. The agent can produce useful work but needs human oversight on outputs; long-horizon runs are not reliably correct without intervention.

Mostly doesn’t work yet: Truly autonomous software handling consequential decisions over weeks. Demos exist; production at scale doesn’t. The agent layer can plan and execute but the failure modes are still substantial enough that consequential autonomy is rare in 2026.

How to evaluate vendor 'agent' claims

The questions that cut through marketing

When a vendor pitches an “agent” product, ask:

What does the agent actually decide? If the answer is “fills in language for a pre-defined workflow,” it’s a workflow runner with AI augmentation. If the answer is “picks tools and chains them based on the situation,” it’s a real orchestration agent. Both can be useful; understand which you’re buying.
What’s the failure rate on a real task? Don’t accept demo videos. Ask for the agent’s success rate on a representative task set; ask what failure modes look like. Honest vendors will share this; pitch-heavy vendors will deflect.
What’s the human-in-the-loop pattern? Most production agents need oversight on outputs. Where in the workflow does the human check? If “nowhere” is the answer, either the use case is trivial or the system is over-claiming.
How does the agent handle the long tail? The 80% of common cases work in demos; the 20% of edge cases are where production agents struggle. Ask specifically about the edge cases in your domain.
What’s the cost per run? Multi-step agent runs consume many LLM calls. Cost per task can be substantial at production volume. Get the realistic number, not the demo number.

The numbers

What agent deployments actually look like at scale

Typical cost per agent run (multi-step tool-using) $0.05–$5 per task depending on complexity

Success rate on well-defined tasks (single domain, clear goal) 80–95% in production

Success rate on open-ended tasks (research, broad reasoning) 50–75% — needs review

Failure mode at long horizon (multi-hour autonomous runs) Material; long autonomous runs are not yet reliably correct without intervention

Production-ready agent verticals (categories where it works today) Customer service triage, structured data extraction, defined research, sales qualification

Categories where agents are still experimental Open-ended product development, autonomous engineering, complex multi-party negotiations, regulated decision-making

Build vs buy tipping point Buy when the agent task is well-defined and a vendor product matches; build when the workflow is bespoke and you have engineering capacity

The “agent” framing is real in some categories and mostly aspirational in others. The bounded-capability framing — “this works for X, doesn’t yet work for Y” — beats both the hype and the dismissal.

In practice

What teams evaluating agent products typically learn first

The biggest disappointment in agent procurement is the gap between the sales demo and the production reality. The demo shows the agent handling a clean, scripted task; production includes the long tail the demo skipped. Teams that buy on demo and discover the production gap waste 6–12 months before recovering. The fix is to insist on production-quality evaluation against a representative task set before purchase.

The most successful agent deployments are narrow, not broad. A single-domain agent that handles 90% of inbound support questions on a defined product is more valuable than a general-purpose agent that handles 60% of anything. Constrain the agent’s scope; let humans handle the rest. The constrained agent is what works in production today.

Late-emerging pattern: agent technology is improving fast enough that “doesn’t work yet” categories shift quarterly. Capabilities that were aspirational in early 2024 are production-deployed in 2026; capabilities that are aspirational now will likely be production in 2027. The build-vs-buy decision benefits from re-evaluation; what you couldn’t buy a year ago you may be able to buy now.

What's next

Related work

For the specific case of AI agents in customer-facing roles, see AI agents for inbound qualification. For the live-chat AI framework, see Live-chat AI: when it works and when it actively hurts trust. For the broader hallucination risk in agentic systems, see AI hallucinations explained. For the open-source-vs-proprietary lens, see Open-source vs proprietary AI — practical tradeoffs.

Common questions

FAQ

Should we build our own agent or buy a vendor product?

Buy when your task matches a vendor's domain and the cost is reasonable; build when your workflow is bespoke enough that no vendor matches. Common pattern: buy for the standard tasks (customer support, lead qualification, meeting scheduling), build for your unique competitive workflows. Building is meaningfully more work than the demos suggest; budget conservatively.

Are AI agents replacing jobs?

Some, with caveats. The agents work well in narrow, well-defined task categories — those jobs see automation pressure. Open-ended, judgement-heavy, relationship-driven, high-stakes work is mostly safe through 2026 and beyond. The job-impact picture is uneven; honest scenarios involve transformation more than wholesale replacement.

How do I distinguish a real agent from a marketing-labelled chatbot?

Three questions. (1) Does it call tools? Real agents interact with systems. (2) Does it plan? Real agents decide multi-step sequences based on the situation. (3) Can it recover from failure? Real agents observe results and adapt; chatbots execute regardless. If a product fails one of these tests, it's a different category than what the agent label suggests.

What's the realistic horizon for autonomous agents handling consequential work?

Categories vary. Narrow, well-defined consequential work (specific medical decisions with strong oversight, structured trading within bounds, defined manufacturing-control tasks) is partially happening now. Open-ended consequential work (running a business autonomously, complex legal decisions) remains future-tense. The arc points toward more autonomy in defined domains; sweeping general-purpose autonomy is further away than vendor pitches suggest.

The spectrum of what 'agent' covers

What works today, what doesn't

The questions that cut through marketing

What agent deployments actually look like at scale

Related work

FAQ

Should we build our own agent or buy a vendor product?

Are AI agents replacing jobs?

How do I distinguish a real agent from a marketing-labelled chatbot?

What's the realistic horizon for autonomous agents handling consequential work?

Sources & references

Related solutions

AI hallucinations explained

AI privacy — what to watch for

AI procurement checklist for non-technical buyers

AI risk assessment for legal and compliance teams