Tools & Function Calling
We covered the basics earlier: a tool is anything the runtime can call on the model's behalf, and the model picks which tool to use based on its name, description, and parameter schema. That is enough to understand the architecture, but it is not enough to build a reliable agent. The moment you give an agent real tools — tools that query databases, move money, send emails, or delete records — you run into a set of problems that toy examples never show you.
How do you design tool schemas so the model picks the right tool? How do you scope permissions so a misbehaving agent cannot do more damage than necessary? How do you make write operations safe when the same tool call might execute twice? These are the engineering decisions that separate a demo from a production agent.
How the Model Selects Tools #
The model does not know what a tool does. It reads the tool's name, description, and parameter schema — the same way you read an API's documentation — and decides whether it fits the current step. This means tool selection is a language problem, not a code problem. The model is pattern-matching on text.
Three things influence selection: the name, the description, and the parameter schema. Get any of them wrong and the model will pick the wrong tool, hallucinate parameters, or skip a tool that would have been perfect.
Names matter more than you think. A tool called query tells the model almost nothing. A tool called search_customer_orders tells it exactly when to reach for it. If you have two tools with similar names — get_user and get_user_profile — the model will guess between them. Rename one to get_user_auth_status and the ambiguity disappears.
Descriptions are steering instructions. The description is not documentation for humans — it is a prompt for the model. "Returns user data" is vague. "Returns the user's name, email, and account status given a user ID. Use this when the agent needs to verify account details." is specific enough to steer selection. Including when to use a tool in the description is one of the highest-leverage things you can do.
Parameter schemas constrain hallucination. If a parameter is a string with no further description, the model will fill it with whatever seems plausible. If you add "enum": ["open", "closed", "pending"], the model picks from the list. If you add "pattern": "^[A-Z]{3}$", you get a three-letter code instead of a full name. Tight schemas produce fewer invalid calls.
Here is an example that puts all three together:
{
"name": "search_customer_orders",
"description": "Search for orders placed by a customer. Use when the user asks about order history, order status, or recent purchases. Returns up to 20 orders sorted by date.",
"parameters": {
"type": "object",
"properties": {
"customer_id": {
"type": "string",
"description": "The unique customer identifier (UUID format)"
},
"status": {
"type": "string",
"enum": ["open", "shipped", "delivered", "cancelled"],
"description": "Filter by order status. Omit to return all statuses."
},
"limit": {
"type": "integer",
"minimum": 1,
"maximum": 20,
"description": "Max number of orders to return. Defaults to 10."
}
},
"required": ["customer_id"],
"additionalProperties": false
}
}
Setting additionalProperties: false is a small detail with big consequences. Without it, the model can invent extra fields — "sort_by": "price", "include_deleted": true — that your tool handler will either ignore or choke on. Closing the schema eliminates that class of errors entirely.
Validation at the Boundary #
The model's output is text. Even when that text is structured JSON, it is still generated by a probabilistic system that can produce invalid arguments, wrong types, or missing required fields. Validation is the last line of defense before a tool call becomes an action in the real world.
A practical validation layer does three things. First, it checks that the tool name exists in the registry — a hallucinated tool name should not trigger any execution path. Second, it validates the arguments against the JSON schema, rejecting calls with wrong types, missing required fields, or values outside allowed ranges. Third, it applies runtime constraints that the schema cannot express, like "this user is not allowed to call the billing API."
def validate_tool_call(name: str, arguments: dict, tool_registry: dict, user_context: dict):
if name not in tool_registry:
raise ToolNotFoundError(f"Unknown tool: {name}")
schema = tool_registry[name]["parameters"]
errors = validate_json_schema(arguments, schema)
if errors:
raise ValidationError(f"Invalid arguments for {name}: {errors}")
permissions = tool_registry[name].get("required_permissions", [])
for perm in permissions:
if perm not in user_context.get("permissions", []):
raise PermissionError(f"Missing permission {perm} for tool {name}")
When validation fails, the right response is to feed the error back to the model as an observation — the same way you handle tool failures. The model can often fix its own mistake on the next turn if you tell it what went wrong. What you should never do is silently coerce the arguments into something valid. Coercion hides the fact that the model is confused, and that confusion will show up later in worse ways.
Read Tools vs. Write Tools #
Not all tools carry the same risk. A tool that searches a database is safe to call repeatedly — the worst case is wasted latency. A tool that sends an email, creates a record, or transfers money can cause real damage if called incorrectly or called twice.
This distinction between read and write tools should be explicit in your runtime, not just a convention. One way to implement it is with a classification on the tool definition itself:
{
"name": "send_invoice_email",
"description": "Send an invoice email to the customer. Use after an order is confirmed and payment is captured.",
"effect": "write",
"requires_confirmation": true,
"parameters": {
"type": "object",
"properties": {
"order_id": { "type": "string", "description": "The order to invoice" },
"recipient_email": { "type": "string", "format": "email" }
},
"required": ["order_id", "recipient_email"],
"additionalProperties": false
}
}
The runtime uses these annotations to apply different guardrails. Read tools run immediately. Write tools can require an extra confirmation step — either a human-in-the-loop approval, or a second model call that reviews the action before committing. The confirmation cost is small compared to the cost of an accidental email blast or a double charge.
┌──────────────┐
│ Model says │
│ "call tool" │
└──────┬───────┘
│
▼
┌──────────────┐ ┌─────────────┐
│ Validate │────▶ │ Read tool? │──── yes ──▶ Execute immediately
│ arguments │ └──────┬──────┘
└──────────────┘ │
no
│
▼
┌───────────────┐
│ Confirmation │──── denied ──▶ Return denial
│ gate │ to model
└───────┬───────┘
│
approved
│
▼
Execute tool
A practical detail: the confirmation gate does not always mean asking a human. For lower-stakes writes, you can use a programmatic check — "is the amount below $100?" or "is this a sandbox environment?" — and only escalate to human approval for actions above a threshold.
Least Privilege and Permission Scoping #
An agent should only have access to the tools it needs for its current task. This is the principle of least privilege, and it matters more for agents than for traditional software because the model decides at runtime which tools to call. You cannot predict at design time exactly which tools the agent will use, which makes a broad tool set riskier.
There are three levels where you can scope permissions.
Tool-set scoping means the agent only sees a subset of tools relevant to its task. A customer service agent gets search_orders, get_refund_status, and create_support_ticket. It does not see delete_account or modify_billing, even if those tools exist in the system. The model cannot call a tool it does not know about.
Per-tool permissions mean a tool checks the caller's identity and role before executing. Even if the agent knows about create_refund, the tool verifies that the current user session has refund authority. This is the same authorization layer you would build for a human-facing API — the agent does not get a free pass.
Argument-level constraints mean the tool's schema or validation logic restricts what values are allowed. A support agent can search orders for the customer in the current conversation but cannot pass an arbitrary customer ID to browse other accounts. The runtime injects the scoped customer ID rather than trusting the model to provide it.
def run_tool_scoped(name: str, model_args: dict, session: dict, tool_registry: dict):
tool = tool_registry[name]
# Inject scoped values the model should not control
scoped_args = {**model_args}
for field, source in tool.get("injected_fields", {}).items():
scoped_args[field] = session[source]
return tool["handler"](**scoped_args)
The injected_fields pattern is important. When a field like customer_id comes from the session rather than the model, you eliminate an entire class of attacks where a prompt injection tricks the model into operating on the wrong account.
Idempotency #
In any system with retries, network failures, or orchestration loops, the same tool call can execute more than once. If the tool is a read — a search, a lookup — this is harmless. If the tool creates an order or sends a payment, executing twice means a duplicate order or a double payment.
Idempotency means that calling a tool with the same arguments multiple times produces the same result as calling it once. The standard mechanism is an idempotency key — a unique identifier that the caller generates and the tool uses to detect duplicates.
def create_order(customer_id: str, items: list, idempotency_key: str):
existing = db.query("SELECT * FROM orders WHERE idempotency_key = %s", idempotency_key)
if existing:
return existing # already processed, return the original result
order = db.insert("INSERT INTO orders ...", customer_id, items, idempotency_key)
return order
The orchestration layer generates the key, not the model. This is critical — if you ask the model to generate an idempotency key, it will likely produce a different one on each retry, defeating the purpose. The runtime should derive the key from the step identity: a hash of the task ID, the step number, and the tool arguments.
import hashlib, json
def make_idempotency_key(task_id: str, step: int, tool_name: str, args: dict) -> str:
payload = json.dumps({"task_id": task_id, "step": step, "tool": tool_name, "args": args}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()
This way, if the orchestration loop retries step 3 with the same arguments, it produces the same key, and the tool recognizes the duplicate.
Designing a Tool Set #
The number and shape of tools you give an agent directly affect its performance. Too few tools and the agent cannot do its job. Too many and the model struggles to pick the right one — selection accuracy drops as the tool count grows, because every tool competes for attention in the prompt.
A practical guideline: start with the smallest set of tools that covers the task, and add tools only when you observe the agent failing because it lacks a capability. If you find yourself with more than fifteen or twenty tools on a single agent, consider splitting the agent into specialized sub-agents with focused tool sets — that is a multi-agent pattern we will look at later.
Tool naming conventions help at scale. If all database tools start with db_ and all communication tools start with notify_, the model learns the pattern and routes more reliably. Consistent parameter naming helps too — if every tool uses customer_id for the same concept, the model does not have to guess whether this tool calls it user_id or cid or account.
When tools overlap in capability, the model will split its calls unpredictably between them. If search_orders and find_customer_purchases do the same thing with different names, remove one. If they do slightly different things, make the difference explicit in the descriptions.
Tool design is not just about input schemas — the output format matters just as much. Give the model a format that is close to what it has seen in natural text, and avoid formats that force it to do bookkeeping. A tool that returns a diff, for example, requires the model to keep an accurate count of changed lines in a chunk header before it writes any code — a task models routinely get wrong. Returning the full updated content is more tokens but fewer errors.
The same principle applies to real-world tool interfaces. Returning structured data inside JSON forces the model to escape newlines and quotes, while returning it in a plain fenced block does not. A practical lesson from large-scale coding agents: when a file-editing tool accepted relative paths, the model made constant mistakes after changing directories; switching to absolute paths eliminated the problem entirely. Think of tool interfaces the same way you think about human interfaces — invest in making them obvious and hard to misuse.
Conclusion #
Tools are the mechanism that lets an agent affect the real world, and the engineering around them is what determines whether it does so safely.
Key takeaways:
- Schema design steers model behavior — names, descriptions, and tight parameter schemas determine whether the model picks the right tool and fills in correct arguments
- Validation at the boundary catches mistakes before they become real-world actions; feed errors back to the model, never silently coerce
- The read/write distinction controls blast radius — classify tools explicitly and gate write operations with confirmation checks
- Permission scoping limits what an agent can touch: filter the tool set per task, enforce per-tool authorization, and inject scoped values the model should not control
- Idempotency keys, generated by the runtime from step identity, make retries safe for write operations
- Start with the smallest tool set that covers the task; consistent naming and non-overlapping capabilities improve selection accuracy