Edge and On-Device Agents
Running an agent on a phone, a robot, or a laptop with no internet connection changes every assumption you have about agent design. The model is smaller, memory is scarce, latency budgets are tighter, and there is no cloud to fall back on — or at least not reliably. This article covers how to build agents that operate under these constraints: choosing and compressing models, managing limited context, handling connectivity gaps, and designing orchestration loops that respect hardware ceilings.
Why Move Agents to the Edge #
The default agent architecture assumes abundant compute and a fast network path to a large model. That works for a coding assistant or a research agent, but it breaks down in several important scenarios.
Privacy and data residency are the most common motivators. A medical device agent that processes patient vitals should not stream raw data to a remote endpoint. An on-device assistant handling biometric unlock patterns has no business sending those patterns anywhere. Running inference locally means sensitive data never leaves the device.
Latency matters when the agent is embedded in a physical control loop. A drone performing obstacle avoidance cannot wait 200ms for a round-trip to a remote model. Neither can a factory-floor robot interpreting visual defects in real time. On-device inference gives you single-digit-millisecond response times with small models — something no network call can match.
Availability is the third driver. Field agents — think agricultural monitoring, disaster response, or satellite communication equipment — operate in environments where connectivity is intermittent or nonexistent. If your agent cannot function offline, it cannot function at all in these settings.
Cost rounds out the picture. Inference at scale on large cloud models is expensive. If your agent handles millions of interactions per day — like a smart keyboard prediction engine or an in-car assistant — local inference eliminates per-token API costs entirely.
Small Models and Model Selection #
On-device agents do not get to use frontier models. A 70B-parameter model requires over 35 GB of memory in half-precision — far beyond what a phone or embedded device can handle.
The good news: small language models have improved dramatically. Models in the 1B–3B range now handle tool calling, structured output, and multi-turn conversation with reasonable quality. The key is matching the model to the task complexity:
- Simple classification and routing — models under 1B parameters work well for intent detection, sentiment, or binary decisions.
- Structured tool calling — 1B–3B models can reliably produce JSON tool calls when fine-tuned on tool-use data.
- Multi-step reasoning — 3B–7B models handle light planning and multi-hop reasoning, though they degrade on tasks that would challenge even large models.
Model selection for the edge is about finding the smallest model that reliably works for your specific task distribution — then adding guardrails for the cases where it does not.
┌─────────────────────────────────────────────────────┐
│ Task Complexity │
│ │
│ Simple (intent, classify) ──► < 1B params │
│ Moderate (tool calls, JSON) ──► 1B – 3B params │
│ Complex (planning, reasoning)──► 3B – 7B params │
│ Beyond device capability ──► Cloud fallback │
│ │
└─────────────────────────────────────────────────────┘
Quantization and Compression #
Even a 3B-parameter model at full FP16 precision demands around 6 GB of memory. On a mobile device with 4–8 GB total RAM (shared with the OS, apps, and GPU), that is unworkable. Quantization is the primary technique for fitting models into tight memory envelopes.
Post-training quantization (PTQ) reduces weight precision after training. Common bit-widths:
- INT8 — halves model size with minimal quality loss. Widely supported on mobile NPUs and GPUs.
- INT4 — quarters model size. Quality degrades slightly on complex reasoning tasks but remains acceptable for structured outputs.
- Mixed precision — keeps attention layers at higher precision (INT8) while quantizing feed-forward layers more aggressively (INT4 or lower). This preserves the model's ability to attend to context while saving memory in the bulk of parameters.
Quantization-aware training (QAT) incorporates quantization into the training loop itself, producing models designed to operate at reduced precision. QAT typically yields 1–2 points better accuracy than PTQ at the same bit-width, but requires access to training infrastructure.
Beyond quantization, several other compression techniques apply:
- Knowledge distillation — train the small model to mimic a larger teacher model's outputs. Covered in detail in the discussion on fine-tuning and distillation.
- Pruning — remove redundant weights or attention heads. Structured pruning (removing entire neurons or layers) is more hardware-friendly than unstructured pruning.
- Weight sharing and low-rank factorization — reduce parameter count by sharing weights across layers or decomposing weight matrices.
The trade-off is always quality versus size versus inference speed. A well-quantized 3B model at INT4 occupies roughly 1.5 GB and runs comfortably on a modern phone's NPU.
The On-Device Agent Loop #
The orchestration loop for an on-device agent looks structurally similar to a cloud agent — observe, reason, act — but every step is constrained.
┌─────────────────────────────────────────────┐
│ On-Device Agent Loop │
│ │
│ ┌─────────┐ ┌─────────┐ ┌────────┐ │
│ │ Input │───►│ Model │───►│ Tool │ │
│ │ (local) │ │ (quant) │ │ (local)│ │
│ └─────────┘ └─────────┘ └────────┘ │
│ ▲ │ │ │
│ │ ▼ │ │
│ │ ┌───────────┐ │ │
│ │ │ Context │ │ │
│ └───────│ Manager │◄────────┘ │
│ └───────────┘ │
│ │ │
│ ▼ │
│ ┌────────────┐ │
│ │ Cloud │ (when online) │
│ │ Fallback │ │
│ └────────────┘ │
└─────────────────────────────────────────────┘
Context management is more aggressive on-device. With context windows often limited to 2K–4K tokens (to keep memory and latency manageable), the agent must be ruthless about what enters the context. Techniques include:
- Summarizing conversation history rather than keeping raw turns.
- Using a fixed-size sliding window and evicting older turns.
- Caching tool results as compact key-value pairs rather than full responses.
Tool execution is local wherever possible. On-device tools include file system access, sensor readings, local databases, and platform APIs. Each tool invocation avoids a network round-trip, but tools must be lightweight — you cannot spawn a heavy subprocess on a resource-constrained device without affecting responsiveness.
Inference scheduling matters on shared hardware. The NPU (neural processing unit) or GPU might also serve the camera pipeline, the display compositor, or other ML tasks. The agent's inference requests must be queued and prioritized to avoid starving other system components.
Offline Fallback and Hybrid Architectures #
Real-world on-device agents use a hybrid approach: handle what you can locally, escalate to the cloud when connectivity is available and the task exceeds local capability.
The design pattern looks like this:
- Classify locally — determine whether the request can be handled on-device. Simple intent classification tells you if the task is within the local model's competence.
- Execute locally when possible — if the local model can handle it, run the full agent loop on-device. No network required.
- Queue for cloud when offline — if the task requires cloud capability but connectivity is unavailable, queue it. Provide the user a degraded but useful response now, and upgrade it when connectivity returns.
- Escalate to cloud when online — for complex tasks that exceed local model capability, route to a larger cloud model. The local agent acts as a pre-processor: summarizing context, extracting key information, and sending a compact payload rather than raw history.
class HybridRouter:
def __init__(self, local_model, cloud_client, connectivity_monitor):
self.local = local_model
self.cloud = cloud_client
self.monitor = connectivity_monitor
self.queue = []
def route(self, request):
complexity = self.local.classify_complexity(request)
if complexity == "simple":
return self.local.run(request)
if self.monitor.is_online():
summary = self.local.summarize_context(request)
return self.cloud.run(summary)
# Offline: provide degraded local response, queue full request
self.queue.append(request)
return self.local.run_best_effort(request)
def sync_queue(self):
"""Called when connectivity is restored."""
while self.queue:
request = self.queue.pop(0)
result = self.cloud.run(request)
self.notify_user(result)
def notify_user(self, result):
# App-specific delivery: push notification, UI update, local event, etc.
raise NotImplementedError
The offline queue pattern introduces eventual consistency into agent behavior. The user gets an immediate (possibly lower-quality) answer, then receives an upgraded answer later. This requires careful UX design — you need to communicate that a better answer is coming, without confusing the user with contradictory responses.
Hardware Considerations #
On-device inference runs on a variety of hardware targets, each with different characteristics:
- NPUs (Neural Processing Units) — dedicated silicon for matrix operations. Found in most modern phones and edge SoCs. Offer the best performance-per-watt for inference but have limited programmability.
- Mobile GPUs — more flexible than NPUs, support a wider range of operations, but consume more power. Useful when the model uses operations not supported by the NPU.
- CPUs — the fallback. Slower for inference but universally available. Optimized runtimes (GGML, llama.cpp) make CPU inference viable for small models.
The inference runtime matters as much as the hardware. Frameworks like ONNX Runtime Mobile, TensorFlow Lite, Core ML, and llama.cpp handle the mapping from model to hardware. Each has different operator coverage, quantization support, and platform availability.
Thermal throttling is a real constraint that cloud agents never face. Sustained inference on a phone will heat the device, triggering thermal management that reduces clock speeds. An agent that runs a tight reasoning loop for 30 seconds may find its inference time doubling mid-task as the device throttles. Designing for this means limiting loop iterations and batching inference calls rather than running them continuously.
Trade-Offs and When to Use On-Device Agents #
On-device agents are not a universal replacement for cloud agents. They excel in specific scenarios and struggle in others.
Use on-device agents when:
- Privacy requirements prevent data from leaving the device.
- Latency requirements are below what network round-trips can deliver.
- The task is well-defined and within small-model capability (classification, structured extraction, simple tool use).
- Connectivity is unreliable or nonexistent.
- Scale makes per-token API costs prohibitive.
Stay with cloud agents when:
- The task requires frontier-model reasoning (complex planning, nuanced generation, large-context synthesis).
- The tool ecosystem requires network access (web search, API calls, database queries against remote systems).
- The agent needs to coordinate with other agents or services in real time.
- Model updates need to deploy instantly without device-side downloads.
The hybrid approach — local for speed and privacy, cloud for capability — is where most production systems land. The challenge is making the transition seamless: the user should not notice whether their request was handled locally or remotely.
Conclusion #
Edge and on-device agents trade model capability for privacy, latency, cost, and availability. Building them well requires choosing the right model size for your task, compressing it aggressively through quantization and distillation, managing context within tight memory budgets, and designing hybrid architectures that gracefully degrade when the local model is insufficient or connectivity drops.
The key insight is that on-device does not mean on-device only. The most effective architectures use local inference as the fast path and cloud inference as the escape hatch — with a clean routing layer that makes the decision transparent to the rest of the system. As small models continue to improve, more tasks will shift to the edge, but the hybrid pattern will remain relevant for any scenario where the task complexity exceeds what a constrained device can handle alone.