Key takeaways
- An AI agent inside a mobile app uses a large language model as its brain to plan and take real actions (calling your APIs, fetching data, completing multi-step tasks) toward a goal, not just to chat.
- 2026 is the inflection point. Gartner projects 40% of enterprise apps will feature task-specific AI agents by the end of 2026, up from under 5% in 2025. On-device models from Apple and Google shipped through 2025, and frontier model prices fell by roughly two-thirds at the top tier.
- The same Gartner forecast also says over 40% of agentic AI projects will be cancelled by the end of 2027. The teams that succeed pick one narrow use case, keep a human in the loop, scope tools tightly, and choose the cheapest model that clears their quality bar.
- Costs run from cents per session on small models to dollars per session on frontier models. Custom build budgets commonly land between $20,000 for a simple agent feature and $250,000+ for complex, regulated products.
What an AI agent actually is, in plain English
An AI agent is a piece of software that can understand a goal stated in natural language, decide what steps are needed, take actions using tools you give it (search a database, call a payment API, book a slot), observe the results, and keep going until the task is done. Anthropic draws the key line between workflows and agents: workflows are systems where LLMs and tools are orchestrated through predefined code paths, while agents are systems where LLMs dynamically direct their own processes and tool usage. OpenAI's framing is similar: agents are systems that independently accomplish tasks on your behalf, using a model to execute instructions and tools to take action, always inside clearly defined guardrails.
The practical test for a mobile app: if a feature only answers questions, it is a chatbot or an LLM feature. If it can decide what to do next and take action across your app's systems, it is an agent. That distinction matters because the engineering, the cost model, and the risk profile all change once the model starts doing things instead of just saying them. We dig into how to govern that safely in our guide to audit-ready AI agents.
Agent vs chatbot vs LLM feature vs automation
Four similar-looking categories that are different products:
- Traditional chatbot: reactive. Answers a question and stops. Salesforce frames the gap nicely as a vending machine versus a personal chef.
- LLM-powered feature: a single model call. Summarise this, classify that, rewrite this. The cost is predictable and the failure modes are bounded.
- Simple automation: if-this-then-that rules with no reasoning.
- AI agent: reasons in a loop, chooses tools, adapts to what it observes. Anthropic describes agents as language models working in loops with tools and environmental feedback.
If you are scoping a first version and the answer to your problem is closer to category two or three, build that. Adding agentic complexity before you need it is one of the most common ways teams burn cash. The MVP vs full product framework we use for app scoping applies just as cleanly here.
Key concepts a non-technical founder needs
- Agentic AI: AI that acts autonomously toward goals rather than only responding to prompts.
- Tool use / function calling: the mechanism that lets a model call your code. The model returns a structured request, your backend runs the function, and the result feeds back into the loop.
- Reasoning loop (the agent loop): plan, act, observe, repeat.
- Memory and context: short-term (the current session) and long-term (persists across sessions).
- RAG (retrieval-augmented generation): AWS defines it as the process of optimising the output of a language model so it references an authoritative knowledge base outside its training data before generating a response. It grounds answers in your data and reduces hallucination.
- Multi-agent systems: several specialised agents collaborating, often via handoffs where one agent delegates to another.
- On-device vs cloud: on-device models run on the phone (private, offline-capable, free inference, but limited). Cloud agents run on servers (powerful, but cost money per call and need connectivity).
Why 2026 is the inflection point
Several things changed at once.
Capability. Frontier models got materially better at multi-step reasoning and reliable tool calling. We covered the practical implications of the latest jump in our Claude Opus 4.8 deep dive.
On-device models shipped. Apple announced its Foundation Models framework at WWDC on June 9, 2025, giving developers direct Swift access to the on-device language model behind Apple Intelligence with as few as three lines of code, and free inference. Google shipped Gemini Nano through ML Kit GenAI APIs and the AICore system service on Android, running inference locally with no cloud cost.
Frameworks matured. OpenAI Agents SDK, Anthropic tool use and Model Context Protocol (MCP), Google's Agent Development Kit, AWS Bedrock AgentCore, LangGraph, CrewAI, and LiveKit for voice. The plumbing got boring in a good way.
Prices fell. Anthropic's flagship Opus dropped from $15/$75 per million tokens (Opus 4.1) to $5/$25 for the current Opus generation, a roughly two-thirds cut. Cheaper Haiku and Flash models now make routine agent work near-free.
The market signals are loud. Per Gartner, 40% of enterprise applications will feature task-specific AI agents by the end of 2026 (up from less than 5% in 2025). Gartner also projects 33% of enterprise software applications will include agentic AI by 2028 (up from less than 1% in 2024). McKinsey's State of AI survey of 1,993 participants across 105 countries (June to July 2025) found 62% of organisations are at least experimenting with AI agents, 23% are scaling an agentic system in at least one function, but most are still early, with no more than 10% of respondents scaling agents in any single function. McKinsey's data is a useful counterweight to vendor hype.
Treat the market-size figures (Grand View Research's $7.63B in 2025 to $183B by 2033, MarketsandMarkets' $7.84B to $52.62B by 2030, BCC Research's $8B to $48.3B by 2030) as estimates. The wide spread is a reason to be sceptical of any single number.
Real AI agents shipping in mobile apps right now
The category is no longer theoretical. A sample of what's running in production:
- Customer support (fintech): Klarna's OpenAI-powered assistant handled 2.3M conversations (two-thirds of customer service chats) and the equivalent work of 700 full-time agents, with resolution time dropping from 11 minutes to under 2 and a 25% drop in repeat inquiries, across 23 markets and 35+ languages, per Klarna's February 2024 release. Important caveat: by 2025 Klarna re-expanded human support for complex cases after quality concerns. That is the lesson, not a footnote, and it's exactly the kind of pattern fintech teams should plan for. We work this into every project across our fintech industry practice.
- Personal shopping: Amazon Rufus (folding into Alexa for Shopping in 2026). Per Andy Jassy on Amazon's Q4 2025 earnings call, more than 300M customers used Rufus during 2025, driving nearly $12B in incremental annualised sales, and Rufus users were 60% more likely to complete a purchase.
- Travel booking: Expedia's Romie (an AI travel buddy that plans, shops, and monitors trips, joining group chats to suggest plans) and Kayak AI Mode (conversational ChatGPT-powered search, launched October 2025).
- Banking: Bank of America's Erica. Per BoA's August 2025 release, Erica is assisting nearly 50M users, has surpassed 3B client interactions, averages 58M+ interactions per month, and has delivered 1.7B proactive personalised insights.
- Budgeting fintech: Cleo (4M+ downloads) auto-categorises spending, sets adaptive budgets, and offers cash advances with a deliberately informal tone aimed at younger users.
- Healthcare triage: Ada Health uses adaptive questioning and a clinical reasoning engine to return possible causes and next steps. Healthcare AI carries the highest stakes; we cover the regulatory side in our healthcare practice.
- Fitness coaching: Apple's Workout Buddy (unveiled at WWDC 2025) gives spoken generative coaching from real-time workout metrics, processed privately on-device. WHOOP Coach (built with OpenAI/GPT-4) answers health questions and builds bespoke training plans in 50+ languages. For real-world costing on this category, see our fitness app development cost guide.
- Grocery / food: Instacart's Ask Instacart AI search and agentic Cart Assistant recommend items and fill carts. Instacart also integrated agentic shopping into Google Gemini and ChatGPT checkout.
- Real estate: Zillow AI Mode and Redfin Conversational Search (built with Sierra AI, launched November 2025) let users refine listings conversationally with Fair Housing guardrails. Redfin reported conversational-search users viewed nearly 2x as many listings and were 47% more likely to request a tour. Our real estate practice sees similar patterns on listing flows we build.
- Education / tutoring: Duolingo Max's Video Call with Lily (GPT-4) gives real-time spoken conversation practice and remembers prior calls. It expanded to Android in January 2025.
The common pattern: each of these started narrow (one workflow, one user persona, one clear KPI), proved value, and then expanded. The teams that started broad mostly do not appear on any list.
How they work under the hood
A mobile AI agent has seven layers:
- Perception / input: text, voice, images, or in-app context from the user.
- The LLM brain: interprets the goal and plans the steps.
- Tools / function calling: the model calls functions you define to fetch data or take actions, usually returning structured JSON.
- Memory: session state plus optional long-term memory and RAG retrieval from your knowledge base.
- The agent loop: plan, act, observe, repeat until the task is complete.
- Guardrails: input/output validation, content filters, permission limits, budget caps.
- Backend connection: tools call your APIs, increasingly via Model Context Protocol (MCP), which lets you expose services as agent tools in a standard way.
On-device vs cloud (the 2026 hybrid pattern)
On-device: Apple Foundation Models, Gemini Nano via ML Kit, Core ML, LiteRT/TensorFlow Lite, plus cross-platform options like MLC LLM and React Native ExecuTorch. Private, works offline, no per-call inference cost, no app-size penalty for OS-built-in models. The tradeoff is limited capability. Apple's on-device model is roughly a 3-billion-parameter, 2-bit quantised model with about a 4,096-token context window. Apple recommends against using it for code generation or maths and frames it as a highly-efficient formatter and extractor, explicitly not ChatGPT in your pocket.
Cloud: Claude, GPT, Gemini via API. Powerful, large context windows, strong reasoning and tool use, but you pay per token, you need connectivity, and data leaves the device (a privacy and compliance consideration).
Hybrid: run simple, private, or offline tasks on-device and route hard tasks to the cloud via a coordinator that decides per request. This is the emerging 2026 default, including for voice agents.
How to add an AI agent to your mobile app (step by step)
- Pick one narrow, high-value use case. Clear success criteria. A feedback loop. Anthropic's guidance is to add agentic complexity only when simpler solutions fall short.
- Choose a model / provider. Cloud: Anthropic Claude, OpenAI GPT, Google Gemini, or open-source Llama / Qwen. On-device: Apple Foundation Models (iOS 26+ on supported devices) or Gemini Nano on Android.
- Choose an agent framework / SDK. OpenAI Agents SDK (Agents, handoffs, guardrails, sessions, built-in tracing), Anthropic tool use plus MCP, Google ADK with Vertex AI Agent Engine, AWS Bedrock AgentCore, LangGraph, CrewAI. For real-time voice, LiveKit Agents or Vapi.
- Define your tools / functions. Keep them few and clearly named. OpenAI's docs echo Anthropic's guidance: ideally fewer than 10 functions per namespace, because overlapping or vague tools confuse the model.
- Add memory and RAG. Ground the agent in your help centre, catalogue, or docs through a vector store so answers stay accurate. AWS frames RAG as the cost-effective alternative to retraining.
- Integrate into your app. Agent orchestration usually lives server-side. The mobile client (React Native or Flutter) streams tokens over SSE or WebSocket. React Native options include react-native-ai (Vercel AI SDK), React Native ExecuTorch, and MLC LLM for on-device inference. Flutter can use the firebase_ai package plus HTTP streaming. On iOS, Apple's Foundation Models framework is a native Swift API for on-device work. The same delivery discipline we cover in mobile app deployment strategy applies on the release side.
- Add guardrails and budget caps. Input/output validation, content filters, permission scoping, hard spend limits.
- Design the UX. Stream output. Label AI clearly. Make human handoff easy. Show confidence cues for high-stakes answers.
- Test and evaluate. Build evals early. Red-team for prompt injection before launch.
- Deploy with observability. Tracing, logging, monitoring of tool calls, latency, and cost.
For teams that don't have the in-house bench to handle the orchestration, observability, and red-teaming layers, our mobile application development and AI development practice ship the full stack as one engagement.
What it actually costs in 2026
Token-based API pricing, per 1M tokens (input / output), from official vendor pages:
- Anthropic Claude: Haiku 4.5 $1 / $5. Sonnet 4.6 $3 / $15. Current Opus (4.5 through 4.8) $5 / $25. The older Opus 4.1 was $15 / $75, a roughly two-thirds cut at the top. Batch API is 50% off. Prompt-cache hits cost about 10% of base input (roughly 90% off). Opus 4.7 and later use a new tokeniser that can use up to 35% more tokens for the same text, which raises effective cost.
- OpenAI: flagship GPT-5.5 $5 / $30. GPT-5.4 $2.50 / $15. GPT-5.4-mini $0.75 / $4.50. GPT-5.4-nano $0.20 / $1.25 (cheapest current GPT). Cached input is about 90% cheaper. Batch API is 50% off.
- Google Gemini: Gemini 3.1 Pro Preview $2 / $12 (for prompts up to 200K tokens). Gemini 3 Flash Preview $0.50 / $3. Gemini 2.5 Flash-Lite $0.10 / $0.40 (cheapest production model). Free tier exists for Flash-class models (with content used to improve products). Batch API is 50% off.
Three concepts to internalise:
- Prototyping is cheap. Production at scale is where bills grow. Output tokens cost several times more than input across all three vendors, so output-heavy or long-looping agents cost more.
- Routing matters. Use cheap models (Haiku, Flash-Lite, nano) for routine tasks and reserve frontier models for genuinely hard ones. We dug into this lever in detail when GitHub Copilot moved to token billing, covered in our June 2026 Copilot token billing piece.
- On-device inference has zero per-call cost, but is constrained by device hardware.
Tool calls can add fees (hosted web search billed per 1,000 calls, for example). And the development cost (agency / market estimates, not vendor figures): a simple AI agent feature built on existing models commonly runs $20,000 to $80,000. Mid-complexity builds $80,000 to $150,000. Complex or regulated products $150,000 to $250,000+, occasionally exceeding $1M. Ongoing maintenance often runs 15% to 25% of build cost per year. Ongoing API spend for a mid-scale app is frequently cited in the several-hundred to several-thousand dollars per month range, often reducible 50% to 70% through caching, routing, and lighter models. These ranges vary widely; treat them as estimates. For a fuller cost breakdown on the wider app side, see our mobile app design cost guide and affordable app development in the USA piece.
Risks and limitations every team should plan for
- Hallucination. Confident but wrong output. For high-stakes mobile use (medical, financial, legal), this is a safety issue, not a quality issue. Mitigate with RAG grounding, structured output schemas, confidence cues, and mandatory human verification for high-stakes actions.
- Reliability and non-determinism. Agents are stochastic. The same prompt can produce different results, which complicates testing.
- Latency. Real-time voice agents need roughly sub-500ms perceived latency to feel natural. Multi-step loops add delay.
- Cost unpredictability at scale. A single agent task can consume many tokens across multiple loop iterations.
- Security and prompt injection. In the OWASP Top 10 for LLM Applications 2025, prompt injection holds the top spot for the second consecutive edition. Excessive agency (LLM06:2025) is one of the most significantly expanded entries, broken into excessive functionality, excessive permissions, and excessive autonomy the exact risk of giving an agent more tools or permissions than its task requires.
- Privacy and compliance. GDPR, HIPAA, and CCPA all apply when personal data flows to third-party AI. On-device processing avoids the off-device transfer entirely.
- App store review. Apple's November 13, 2025 guideline update (5.1.2(i)) requires you to clearly disclose where personal data will be shared with third parties, including third-party AI, and obtain explicit permission, via in-app consent rather than a buried privacy policy. Google Play's AI-Generated Content policy requires generative AI apps to prevent restricted or offensive output and to include in-app user reporting / flagging.
- The demo-to-production gap. Gartner predicts over 40% of agentic AI projects will be cancelled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. Most current projects are early-stage experiments driven by hype, not measured outcomes.
Best practices that separate ship from scrap
- Start narrow with one high-value use case that has clear success criteria and a feedback loop.
- Keep a human in the loop for high-stakes or irreversible actions.
- Scope tools tightly (few, clearly named functions). Limits both attack surface and model confusion.
- Add guardrails and hard budget caps.
- Measure ROI with concrete metrics: resolution rate, cost per task, escalation quality.
- Design for failure: clear escalation paths, stop conditions, graceful fallbacks.
- Choose the right model per task. Cheap for routine, frontier only for hard.
- Classify queries by risk and fully automate only the low-risk bands. Klarna's experience proves the point: a 50% automation rate at 98% accuracy beats a 75% automation rate at 90% accuracy in regulated contexts.
A six-week plan that actually ships
If you are building this against a real budget instead of a research deck, here is the cadence that works for most teams:
- Weeks 1 to 2 (scope and prototype): Pick the one use case. Define success criteria and a human fallback. Prototype with a mid-tier model (Claude Sonnet 4.6, GPT-5.4, or Gemini Flash) to validate quality before optimising cost.
- Weeks 3 to 6 (build the agent): Add tools (few, tightly scoped). Ground the agent in your own content with RAG. Wire in guardrails and budget caps. Build evaluations and red-team for prompt injection before any launch.
- Launch (controlled): Ship to a small cohort. Label AI clearly. Make human escalation one tap away. Instrument tracing and cost monitoring from day one.
- Scale only when metrics hold: Expand tools and traffic only if automated-subset accuracy and cost-per-task clear your thresholds. Pull back scope if accuracy drops below your bar, per-task token cost trends up, or red-teaming surfaces prompt-injection issues.
- Pick the deployment model deliberately: Choose on-device when privacy, offline use, or zero inference cost matter and the task is simple. Choose cloud for complex reasoning and large context. Use hybrid routing to get both.
How Brandrums helps you ship one
The June 2026 shift in AI coding tools (covered in our Copilot token billing piece) made the same point that the agent build path makes here: AI features are a metered cloud service, not a fixed perk. Brandrums helps engineering teams stand up the right discipline around them. We scope and build agent features through our mobile application and AI practices, layer in RAG, evals, and guardrails, and ship into your app with a clear cost model and rollout plan. You can see how we approach delivery in our project portfolio, and our AI and machine learning in modern app development guide covers the practical patterns we reach for first.
Key takeaways
- An AI agent in a mobile app plans and takes real actions toward a goal. If your feature only answers questions, it's a chatbot, not an agent.
- 2026 is the inflection year. Capability is up, on-device models from Apple and Google shipped, frameworks matured, frontier-tier prices fell roughly two-thirds at the top.
- The hybrid pattern wins: route simple, private, or offline tasks to on-device models, send hard problems to the cloud, and design the coordinator deliberately.
- Costs are tractable if you route well. Cheap models for routine work, frontier only for hard problems. Budget caps and observability are non-negotiable.
- The teams that ship pick one narrow use case, keep a human in the loop, scope tools tightly, and measure relentlessly. The 40%+ project cancellation rate Gartner predicts hits the teams that don't.
FAQ
What is an AI agent in a mobile app?
An AI agent is software inside the app that uses a language model to plan and take real actions toward a goal. It can call your APIs, fetch data, complete multi-step tasks, and adapt based on what it observes. The simplest test: a chatbot answers questions; an agent decides what to do next and acts.
What is the difference between an AI agent and a chatbot?
A chatbot is reactive. It answers a question and stops. An agent reasons in a loop, chooses tools, and takes actions across your systems to complete a multi-step task. Salesforce frames it as a vending machine versus a personal chef.
Should I run my AI agent on-device or in the cloud in 2026?
Use on-device for simple, private, or offline tasks (summarising, tagging, parsing) where Apple Foundation Models or Gemini Nano can handle the workload. Use cloud for complex reasoning, large context, or tool-heavy agentic loops. Most production apps in 2026 use a hybrid coordinator that routes per request.
How much does it cost to add an AI agent to a mobile app?
API inference ranges from cents per session on small models (Haiku, Flash-Lite, GPT-5.4-nano) to several dollars per session on frontier models with long loops. Custom build budgets commonly run $20,000 to $80,000 for a simple agent feature, $80,000 to $150,000 for mid-complexity, and $150,000 to $250,000+ for complex or regulated products. Ongoing maintenance often runs 15% to 25% of build cost per year.
What are the main risks of shipping an AI agent in a mobile app?
Hallucination, prompt injection (top of the OWASP LLM Top 10), excessive agency (the agent doing more than its task requires), cost unpredictability at scale, latency on voice and real-time use, and app-store review on data-sharing disclosure. Build guardrails, budget caps, RAG grounding, and a clear human-escalation path before launch.
Which AI agent framework should I use?
OpenAI Agents SDK (agents, handoffs, guardrails, sessions, built-in tracing) is the easiest path on OpenAI models. Anthropic tool use plus Model Context Protocol (MCP) is the cleanest on Claude. Google ADK pairs with Vertex AI Agent Engine. AWS Bedrock AgentCore offers a managed runtime, memory, identity, and policy controls. LangGraph and CrewAI are framework-agnostic. For real-time voice, LiveKit Agents.
Ready to add an AI agent to your app?
The teams that ship agent features in 2026 are the ones that pick a single narrow use case, prototype against a mid-tier model, layer RAG and guardrails, and ship into a small cohort with full observability before scaling. We do this work daily through our mobile application and AI development practices. Tell us what your team wants to ship and we will scope the right agent design, cost model, and rollout plan. Or check our pricing options if you are evaluating engineering support for an upcoming build.



