Multimodal Agents
Text-only agents operate in a narrow slice of reality. They read strings and produce strings. But the world humans inhabit is rich with images, audio, video, diagrams, charts, screenshots, and spatial layouts. A support agent that cannot see the error screenshot a user uploaded is limited. A coding agent that cannot read a whiteboard diagram of the desired architecture is incomplete. A monitoring agent that cannot interpret a dashboard screenshot is blind to the most common way humans communicate system state.
Multimodal agents process and reason across multiple input modalities — vision, audio, video, and structured media — within the same agent loop. The model accepts these inputs and grounds its reasoning in them, extracts structured information, and uses that information to drive tool calls and decisions.
Multimodal agent architecture treats rich media as a first-class input to reasoning, planning, and action, with the same rigor as text.
How Multimodality Changes the Agent Loop #
The standard agent loop — reason, act, observe — assumes textual observations. A tool returns a JSON object; the model reads it and decides the next step. Multimodal agents extend this contract: observations can be images, audio clips, video frames, or combinations. The model must interpret non-textual data, extract relevant information, and fold it into its reasoning.
┌────────────────────────────────────────────────────────────────┐
│ Multimodal Agent Loop │
│ │
│ ┌──────────┐ ┌──────────┐ ┌───────────────────────┐ │
│ │ Reason │───►│ Act │───►│ Observe │ │
│ │ │ │ │ │ │ │
│ │ (text + │ │ (tools) │ │ text │ image │ audio │ │
│ │ vision) │ │ │ │ │ │ │ │
│ └──────────┘ └──────────┘ └───────────────────────┘ │
│ ▲ │ │
│ │ │ │
│ └──────────────────────────────────────┘ │
│ │
│ Context Window: │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ system prompt │ text history │ images │ audio │ tool out │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
The architectural implications are significant:
Context budget shifts. A single high-resolution image can consume thousands of tokens. An agent that freely ingests images without managing their cost will exhaust its context window far faster than a text-only agent. You need explicit policies: when to include full images, when to downscale, when to extract text and discard the visual.
Observation types multiply. The orchestrator must handle heterogeneous observation formats. A tool might return a screenshot (image), a transcription (text), or a spectrogram (image of audio). The routing logic must tag and format each observation correctly for the model.
Grounding becomes explicit. When the model references "the red button in the top-right corner," it is grounding language in visual space. The agent architecture must support this — passing coordinates, bounding boxes, or region annotations back to tools that act on the visual.
Vision in the Agent Loop #
Vision is the most mature multimodal capability in current models. An agent with vision can interpret screenshots, read diagrams, parse charts, understand UI layouts, and extract information from photographs or scanned documents.
Patterns for Visual Input #
There are three common patterns for how visual data enters the agent loop:
Direct observation. A tool returns an image as its output. For example, a browser tool returns a screenshot after navigating to a page. The model sees the screenshot and decides the next action (click a button, fill a form, scroll).
User-provided media. The user attaches an image to their request ("Here is a photo of the error message — what does it mean?"). The image is part of the initial task context.
Agent-initiated capture. The agent decides it needs visual information and calls a tool to obtain it. For example, "Take a screenshot of the current application state" or "Render this HTML and return the visual output."
from dataclasses import dataclass, field
from enum import Enum
class MediaType(Enum):
IMAGE = "image"
AUDIO = "audio"
VIDEO = "video"
@dataclass
class MediaAttachment:
media_type: MediaType
content: bytes
mime_type: str
metadata: dict = field(default_factory=dict)
def token_estimate(self) -> int:
"""Estimate context tokens this media will consume."""
if self.media_type == MediaType.IMAGE:
# Rough heuristic: depends on resolution and model
width = self.metadata.get("width", 1024)
height = self.metadata.get("height", 1024)
# Most models tile images into patches
tiles = (width // 512) * (height // 512)
return max(tiles * 170, 85) # minimum 85 tokens per image
elif self.media_type == MediaType.AUDIO:
duration_sec = self.metadata.get("duration_seconds", 0)
return int(duration_sec * 25) # ~25 tokens per second
return 0
class MultimodalObservation:
"""An observation that may contain text, media, or both."""
def __init__(self, text: str = "", media: list[MediaAttachment] = None):
self.text = text
self.media = media or []
def total_token_estimate(self) -> int:
text_tokens = len(self.text) // 4 # rough char-to-token estimate
media_tokens = sum(m.token_estimate() for m in self.media)
return text_tokens + media_tokens
Visual Grounding and Coordinate Systems #
When an agent sees a screenshot and needs to interact with it (clicking, typing, selecting), it must communicate spatial positions. This requires a grounding protocol — a shared coordinate system between the model's visual understanding and the tools that execute actions.
Two approaches dominate:
Pixel coordinates. The model outputs raw (x, y) coordinates referencing the image dimensions. Simple and direct, but brittle — coordinates shift if the resolution changes or the UI rescales.
Semantic references. The model outputs a description ("the Submit button" or "the third item in the list"), and a downstream tool resolves the description to coordinates using accessibility trees, DOM inspection, or visual element detection.
@dataclass
class VisualTarget:
"""A target location in a visual observation."""
# Normalized coordinates (0.0 to 1.0)
x: float
y: float
# Optional semantic label for debugging and logging
label: str = ""
# Confidence that this is the correct target
confidence: float = 1.0
@dataclass
class BoundingBox:
"""A rectangular region in normalized coordinates."""
x_min: float
y_min: float
x_max: float
y_max: float
label: str = ""
class VisualGroundingTool:
"""Resolves model references to actionable coordinates."""
async def locate_element(
self, screenshot: bytes, description: str
) -> VisualTarget | None:
"""Find an element in a screenshot by natural language description."""
# Could use a dedicated vision model, OCR + layout analysis,
# or accessibility tree matching
result = await self.vision_model.find(
image=screenshot,
query=f"Find the exact center of: {description}",
)
if result.found:
return VisualTarget(
x=result.x / result.image_width,
y=result.y / result.image_height,
label=description,
confidence=result.confidence,
)
return None
async def extract_regions(
self, screenshot: bytes
) -> list[BoundingBox]:
"""Identify all interactive regions in a screenshot."""
regions = await self.vision_model.segment(
image=screenshot,
prompt="Identify all clickable elements, input fields, and buttons",
)
return [
BoundingBox(
x_min=r.x1, y_min=r.y1,
x_max=r.x2, y_max=r.y2,
label=r.label,
)
for r in regions
]
The tradeoff: pixel coordinates are precise but model-dependent (the model must be good at spatial reasoning). Semantic references are more robust but add a resolution step that can fail. Production systems often combine both — the model outputs a semantic description and approximate coordinates, and the system uses whichever resolves more reliably.
Audio and Speech in the Agent Loop #
Audio adds two capabilities: understanding spoken input and generating spoken output. An agent might listen to a customer service call, transcribe and analyze it, then respond with synthesized speech. Or it might process voice commands in a hands-free workflow.
Speech-to-Agent Pipeline #
The simplest architecture transcribes audio to text before feeding it to the agent — the model never sees raw audio. This works when the linguistic content is all that matters. But it loses information: tone, emotion, speaker identity, background sounds, and non-verbal cues.
A more capable architecture passes audio directly to a model that understands it natively, preserving the full signal:
class AudioProcessor:
"""Process audio observations for the agent loop."""
def __init__(self, transcription_model, audio_model):
self.transcription_model = transcription_model
self.audio_model = audio_model
async def process_audio(
self, audio: bytes, strategy: str = "native"
) -> MultimodalObservation:
if strategy == "transcribe_first":
# Lose non-verbal info, but cheaper and works with text-only models
transcript = await self.transcription_model.transcribe(audio)
return MultimodalObservation(
text=f"[Audio transcription]: {transcript.text}",
media=[],
)
elif strategy == "native":
# Pass raw audio to a model that understands it
return MultimodalObservation(
text="",
media=[MediaAttachment(
media_type=MediaType.AUDIO,
content=audio,
mime_type="audio/wav",
metadata={"duration_seconds": len(audio) / 32000},
)],
)
elif strategy == "hybrid":
# Transcribe for text reasoning, but keep audio for tone analysis
transcript = await self.transcription_model.transcribe(audio)
return MultimodalObservation(
text=f"[Transcript]: {transcript.text}",
media=[MediaAttachment(
media_type=MediaType.AUDIO,
content=audio,
mime_type="audio/wav",
metadata={
"duration_seconds": len(audio) / 32000,
"speaker_count": transcript.speaker_count,
},
)],
)
When to Use Native Audio vs. Transcription #
Use transcription when:
- The agent only needs the words spoken, not how they were spoken.
- The downstream model does not support audio input.
- Cost is a concern — audio tokens are expensive relative to text tokens.
- The audio is clear, single-speaker, and in a well-supported language.
Use native audio when:
- Tone, emotion, or urgency matter (customer service escalation detection).
- Multiple speakers must be distinguished in real-time.
- Non-speech sounds carry meaning (machine diagnostics from audio, environmental monitoring).
- Transcription quality is poor for the domain (heavy accents, technical jargon, noisy environments).
Video as an Agent Input #
Video is the most expensive modality. A single minute of video at 30 fps produces 1,800 frames. Feeding all of them to a model is impractical — both from a cost and context-window perspective. Agents that process video must be aggressive about frame selection.
Frame Sampling Strategies #
class VideoFrameSampler:
"""Extract representative frames from video for agent consumption."""
async def sample_uniform(
self, video_path: str, num_frames: int = 8
) -> list[bytes]:
"""Sample frames at uniform intervals."""
duration = await self._get_duration(video_path)
timestamps = [duration * i / num_frames for i in range(num_frames)]
return [await self._extract_frame(video_path, t) for t in timestamps]
async def sample_keyframes(self, video_path: str) -> list[bytes]:
"""Extract only keyframes (scene changes)."""
# Use scene detection to find meaningful transitions
scenes = await self._detect_scenes(video_path)
return [await self._extract_frame(video_path, s.start) for s in scenes]
async def sample_adaptive(
self, video_path: str, query: str, max_frames: int = 12
) -> list[bytes]:
"""Sample frames relevant to a specific query."""
# First pass: uniform sample at low resolution
coarse_frames = await self.sample_uniform(video_path, num_frames=30)
# Score each frame's relevance to the query
scores = await self._score_relevance(coarse_frames, query)
# Select top-K most relevant frames
top_indices = sorted(
range(len(scores)), key=lambda i: scores[i], reverse=True
)[:max_frames]
# Re-extract at full resolution
duration = await self._get_duration(video_path)
timestamps = [duration * i / 30 for i in top_indices]
return [await self._extract_frame(video_path, t) for t in timestamps]
The tradeoffs:
- Uniform sampling is simple and unbiased but wastes budget on redundant frames (long static shots).
- Keyframe extraction captures transitions but misses important moments within a scene.
- Adaptive sampling is the most efficient but requires a scoring pass — an additional model call that adds latency and cost.
For most agent use cases, a hybrid works best: extract keyframes to understand the video's structure, then do targeted full-resolution extraction around moments the agent determines are relevant.
Tool Design for Rich Media #
Tools in a multimodal agent must handle media as both input and output. This changes tool schema design in important ways.
Media-Aware Tool Schemas #
A traditional tool schema describes text parameters. A multimodal tool must also declare what media it accepts and produces:
@dataclass
class MultimodalToolSchema:
name: str
description: str
text_parameters: dict # Standard JSON schema for text args
media_inputs: list[dict] = None # What media the tool accepts
media_outputs: list[dict] = None # What media the tool produces
token_cost_estimate: str = "" # Help the agent budget context
# Example: a chart analysis tool
chart_analyzer = MultimodalToolSchema(
name="analyze_chart",
description="Extract data points and trends from a chart image",
text_parameters={
"question": {
"type": "string",
"description": "What to analyze in the chart",
}
},
media_inputs=[{
"name": "chart_image",
"type": "image",
"formats": ["png", "jpg", "svg"],
"max_resolution": "2048x2048",
"required": True,
}],
media_outputs=[{
"name": "annotated_chart",
"type": "image",
"description": "Chart with highlighted regions and annotations",
}],
token_cost_estimate="~800 tokens for input image, ~200 for output",
)
The Media Pipeline Pattern #
Complex multimodal workflows often require a media pipeline — a sequence of transformations that prepare media for the model or convert model output into actionable media:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Ingest │───►│ Transform│───►│ Model │───►│ Output │
│ │ │ │ │ │ │ │
│ raw file │ │ resize, │ │ reason │ │ annotate │
│ URL │ │ crop, │ │ about │ │ generate │
│ stream │ │ denoise │ │ content │ │ media │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
class MediaPipeline:
"""Transform media for optimal agent consumption."""
def __init__(self, config: dict):
self.max_image_tokens = config.get("max_image_tokens", 1600)
self.max_audio_seconds = config.get("max_audio_seconds", 120)
async def prepare_image(self, image: bytes, mime_type: str) -> bytes:
"""Resize and optimize an image for model consumption."""
width, height = self._get_dimensions(image)
# Downscale if token cost is too high
estimated_tokens = self._estimate_image_tokens(width, height)
if estimated_tokens > self.max_image_tokens:
scale_factor = (self.max_image_tokens / estimated_tokens) ** 0.5
new_width = int(width * scale_factor)
new_height = int(height * scale_factor)
image = await self._resize(image, new_width, new_height)
return image
async def prepare_audio(self, audio: bytes, mime_type: str) -> bytes:
"""Trim and normalize audio for model consumption."""
duration = self._get_audio_duration(audio)
if duration > self.max_audio_seconds:
# Intelligent truncation: keep beginning and end
head = await self._trim(audio, 0, self.max_audio_seconds * 0.7)
tail = await self._trim(
audio, duration - self.max_audio_seconds * 0.3, duration
)
audio = await self._concatenate([head, tail])
return await self._normalize_volume(audio)
Cross-Modal Grounding #
The most powerful capability of multimodal agents is cross-modal grounding — connecting information across modalities to build a unified understanding. An agent that sees a chart, reads the associated text, and hears a verbal explanation can triangulate meaning in ways that single-modality processing cannot.
Grounding Strategies #
Visual-to-text grounding. The agent sees an image and produces structured text that downstream reasoning can use. This is the most common pattern — converting visual information into a format the text-reasoning pipeline can manipulate.
class CrossModalGrounder:
"""Ground information across modalities into unified representations."""
async def ground_visual_to_structured(
self, image: bytes, context: str, model
) -> dict:
"""Extract structured data from a visual observation."""
response = await model.generate(
messages=[
{"role": "system", "content": (
"Extract structured information from this image. "
"Return valid JSON with the relevant data points."
)},
{"role": "user", "content": [
{"type": "text", "text": context},
{"type": "image", "data": image},
]},
],
response_format={"type": "json_object"},
)
return json.loads(response.content)
async def ground_audio_with_visual(
self, audio: bytes, screenshot: bytes, model
) -> dict:
"""Combine audio and visual context for richer understanding."""
response = await model.generate(
messages=[
{"role": "system", "content": (
"You are observing both audio and a visual scene. "
"Describe what is happening by combining both sources. "
"Note any contradictions between what is said and shown."
)},
{"role": "user", "content": [
{"type": "audio", "data": audio},
{"type": "image", "data": screenshot},
{"type": "text", "text": "What is the current situation?"},
]},
],
)
return {"grounded_description": response.content}
Text-to-visual grounding. The agent has a textual description and needs to locate it in a visual scene — "find the error message in this screenshot" or "locate the anomalous region in this medical scan."
Temporal grounding. In video or audio streams, the agent must connect events across time — "the sound that occurred at 0:34 corresponds to the visual flash at 0:35." This requires maintaining a timeline and correlating events across modalities.
Context Budget Management for Multimodal Agents #
Media is expensive in token terms. A single high-resolution image might cost 1,600 tokens. A 30-second audio clip might cost 750 tokens. An agent processing a document with 20 pages of scanned images could consume its entire context window on visuals alone.
Effective multimodal agents implement media budget policies:
class MultimodalContextBudget:
"""Manage context allocation across modalities."""
def __init__(self, total_budget: int = 128_000):
self.total_budget = total_budget
self.allocations = {
"system_prompt": int(total_budget * 0.05),
"text_history": int(total_budget * 0.30),
"media": int(total_budget * 0.40),
"tool_results": int(total_budget * 0.15),
"generation": int(total_budget * 0.10),
}
self.used = {k: 0 for k in self.allocations}
def can_add_media(self, token_cost: int) -> bool:
return self.used["media"] + token_cost <= self.allocations["media"]
def suggest_resolution(self, original_tokens: int) -> dict:
"""Suggest how to fit media within budget."""
available = self.allocations["media"] - self.used["media"]
if original_tokens <= available:
return {"action": "include_full", "tokens": original_tokens}
# Try downscaling
scale = available / original_tokens
if scale >= 0.25:
return {
"action": "downscale",
"scale_factor": scale ** 0.5,
"tokens": available,
}
# Too large even with downscaling — extract text instead
return {
"action": "extract_text_only",
"reason": f"Media ({original_tokens} tokens) exceeds budget ({available} available)",
}
def evict_oldest_media(self, needed_tokens: int) -> list[str]:
"""Remove oldest media from context to make room."""
# Implementation depends on context management strategy
pass
The decision hierarchy for media inclusion:
- Is the media essential for the current step? If not, defer or skip it.
- Can the agent extract what it needs via text? OCR, transcription, or structured extraction may be cheaper than raw media inclusion.
- Can the media be downscaled without losing critical information? A 4K screenshot usually works fine at 1024px for UI interpretation.
- Should older media be evicted? The screenshot from step 3 may no longer be relevant at step 15.
Multimodal Tool Composition #
Real-world multimodal tasks often chain tools that operate on different modalities. A document analysis agent might: photograph a whiteboard (vision tool), transcribe handwritten text (OCR tool), interpret the diagram structure (vision + reasoning), and generate a formatted document (text generation).
class MultimodalToolChain:
"""Compose tools across modalities for complex tasks."""
def __init__(self, tools: dict):
self.tools = tools
async def analyze_document(self, document_image: bytes) -> dict:
"""Multi-step document analysis combining modalities."""
# Step 1: Detect document regions
regions = await self.tools["layout_detector"].run(
image=document_image
)
results = {"text_blocks": [], "charts": [], "tables": []}
for region in regions:
cropped = await self.tools["crop"].run(
image=document_image,
bbox=region.bbox,
)
if region.type == "text":
# OCR for text regions
text = await self.tools["ocr"].run(image=cropped)
results["text_blocks"].append(text)
elif region.type == "chart":
# Vision model for chart interpretation
analysis = await self.tools["chart_analyzer"].run(
image=cropped,
question="Extract all data points and the trend",
)
results["charts"].append(analysis)
elif region.type == "table":
# Specialized table extraction
table_data = await self.tools["table_extractor"].run(
image=cropped
)
results["tables"].append(table_data)
return results
Trade-Offs in Multimodal Agent Design #
Building multimodal agents involves several design tensions:
Fidelity vs. cost. Higher-resolution media provides more information but consumes more tokens. The agent must balance "seeing better" against running out of context or exceeding budget. There is no universal right answer — a medical imaging agent needs high fidelity; a UI navigation agent can work with lower resolution.
Native vs. extracted. Passing raw media to the model preserves all information but is expensive and may be wasted on simple cases. Extracting text or structure first is cheaper but lossy. The best approach is adaptive: try extraction first, and fall back to native media only when extraction quality is insufficient.
Latency vs. richness. Processing video frames or long audio adds significant latency. Real-time agents (voice assistants, live monitoring) must trade off response speed against the amount of media they can process per turn.
Generalist vs. specialist. A general-purpose multimodal model can handle many media types but may underperform on specific tasks. A pipeline of specialist models (dedicated OCR, dedicated audio transcription, dedicated chart reader) is more accurate but harder to orchestrate and more expensive to maintain.
| Approach | Strengths | Weaknesses |
|---|---|---|
| Native multimodal | Simple pipeline, preserves all info | Expensive, model-dependent quality |
| Extract-then-reason | Cheaper, works with text models | Lossy, extraction errors compound |
| Hybrid (adaptive) | Best quality/cost ratio | Complex routing logic |
| Specialist pipeline | Highest accuracy per task | Orchestration overhead, multiple models |
Conclusion #
Multimodal agents extend the reasoning-and-action loop beyond text into the full richness of human communication. The core architectural principles are: treat media as first-class observations with explicit token budgets, design tools that declare their media inputs and outputs, implement cross-modal grounding to unify information across modalities, and manage context aggressively because media is expensive.
The key design decisions are when to process media natively versus extracting structured text, how to sample video and audio without losing critical moments, and how to budget limited context across competing modalities. Getting these tradeoffs right determines whether a multimodal agent is genuinely more capable than its text-only counterpart — or simply more expensive without proportional benefit.