LLM Agent Fundamentals

TL;DR

An agent is a model using tools in a loop against an environment. The model supplies reasoning; the harness — the loop, tool definitions, context management, permissions, and sandbox you build around it — determines how much of that reasoning turns into useful work. Modern agents use native tool calling (typed JSON schemas, parallel calls) rather than text-parsed prompts, treat the context window as working memory backed by files and compaction, verify their own work against ground truth (tests, linters, screenshots), and run inside permission-gated sandboxes. Design the environment and the feedback loops first; the model is the most replaceable component.

What Makes an Agent Different from a Chatbot?

A chatbot maps one input to one output. A workflow chains LLM calls along a code path you wrote. An agent lets the model direct its own process: it decides which tool to call next based on what the last tool returned, and it keeps going until the goal is met or the harness stops it. That autonomy is the value and the risk — agents handle tasks you couldn't enumerate steps for, and they fail in ways you didn't enumerate either.

Aspect	Chatbot	Workflow	Agent
Control flow	None	Your code	The model
Actions	Text only	Predefined LLM calls	Tools chosen at runtime
Steps	1	Fixed	Open-ended, bounded by harness
Failure mode	Bad answer	Bad step output	Compounding drift across steps
Cost profile	Predictable	Predictable	Variable — budget in the harness
Right for	Q&A	Decomposable, known tasks	Open-ended tasks with verifiable outcomes

Start with the simplest form that solves the problem. Most production "agent" systems are workflows, and should be — see Orchestration Patterns for the taxonomy.

The Three Components

Model. Frontier models are post-trained specifically for agentic tool use: they natively interleave reasoning with tool calls, recover from errors, and sustain multi-hour tasks. Capability differences between models matter, but a mediocre harness wastes a frontier model — public coding benchmarks always report a model + harness pair, never a model alone.
Harness. Everything between the model and the world. This is where most engineering effort goes; see Harness Engineering for the full treatment.
Environment. What the agent can observe and change. The richer and more inspectable the environment (a real shell, a real filesystem, real test suites), the better the agent's feedback loops. Design environments so that progress is observable — an agent that can run the tests doesn't need to guess whether its change worked.

The Agent Loop

The 2023-era pattern — prompt the model to emit Thought: / Action: / Action Input: text and parse it with regexes — is obsolete. Every major API exposes native tool calling: you declare tools as JSON Schema, the model returns typed tool-call blocks, and you return results as structured messages. No parsing, no format drift, parallel calls for free.

python

import anthropic

client = anthropic.Anthropic()

TOOLS = [
    {
        "name": "bash",
        "description": "Run a shell command in the project sandbox. "
                       "Returns stdout and stderr, truncated to 50KB.",
        "input_schema": {
            "type": "object",
            "properties": {
                "command": {"type": "string", "description": "Command to execute"},
            },
            "required": ["command"],
        },
    },
    {
        "name": "edit_file",
        "description": "Replace an exact string in a file. Fails if the string "
                       "is not found or matches more than once.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string"},
                "old": {"type": "string"},
                "new": {"type": "string"},
            },
            "required": ["path", "old", "new"],
        },
    },
]

def agent_loop(task: str, max_turns: int = 50) -> str:
    messages = [{"role": "user", "content": task}]

    for _ in range(max_turns):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=8192,
            system=SYSTEM_PROMPT,          # stable across turns: prompt-cache friendly
            tools=TOOLS,
            messages=messages,
        )

        if response.stop_reason != "tool_use":
            return next(b.text for b in response.content if b.type == "text")

        # Append the assistant turn verbatim, then execute every tool call
        # in it (the model may request several in parallel).
        messages.append({"role": "assistant", "content": response.content})
        results = [
            {
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": execute_tool(block.name, block.input),
            }
            for block in response.content
            if block.type == "tool_use"
        ]
        messages.append({"role": "user", "content": results})

    raise RuntimeError("max_turns exceeded — task did not converge")

The OpenAI equivalent uses tools=[{"type": "function", ...}] and tool_calls on the response; the shape of the loop is identical. The legacy functions / function_call parameters are deprecated.

What the production version adds on top of this skeleton:

Permission gate before execute_tool — classify actions as read / write / irreversible and require approval for the last class.
Token budget accounting — track context usage every turn; trigger compaction before overflow (see Context Management).
Checkpointing — persist messages so a crashed or interrupted run resumes instead of restarting.
Telemetry — record every turn as a trace span: tool name, latency, tokens, cache hit rate.
Streaming and interruption — surface partial output and let a human steer mid-run.

Extended thinking

Reasoning-capable models can emit internal thinking tokens before acting, and interleave thinking between tool calls — reflect on a result before choosing the next action. This replaced most prompt-level reasoning scaffolds (Chain-of-Thought, Tree-of-Thought); you control it with a thinking budget parameter rather than prompt tricks. Spend the budget where verification is hard and steps are irreversible; keep it low for mechanical tool-use sequences.

Tools

Tools are the agent's API to the world. Tool design is prompt engineering with a type system — the model reads your schemas and descriptions the way a new hire reads your docs, and ambiguity costs you real tokens and wrong calls.

Design principles

Few, orthogonal tools beat many overlapping ones. Every tool competes for the model's attention in every turn. If two tools can accomplish the same thing, the model will sometimes pick the worse one. Consolidate (search_logs(filters) rather than five log tools).
Descriptions are micro-prompts. State what the tool does, when to use it over its neighbors, what it returns, and its failure modes. The edit_file description above tells the model how to avoid the two common failure cases.
Token-efficient outputs. Return what the model needs to decide the next step — truncate, paginate, summarize. A tool that dumps a 200KB JSON blob into context does more damage than one that errors.
Errors that teach. "File not found: src/uesr.py — did you mean src/user.py?" lets the model self-correct in one turn. A bare stack trace often costs three.
Idempotent and safe to retry wherever possible — the loop will retry.

General-purpose vs. structured tools

The highest-leverage tools in practice are the general-purpose ones: a shell, file read/write/edit, and code execution. A bash tool subsumes hundreds of specialized tools, and writing a script is often more token-efficient than chaining ten tool calls — the "code execution as tool use" pattern has the agent write a program that calls APIs in a loop instead of making each call through the context window. Add structured tools where types and guardrails matter: payments, ticket updates, anything irreversible.

MCP: the integration standard

The Model Context Protocol (MCP) standardizes how tools, resources, and prompts are exposed to agents — an MCP server wraps a system (GitHub, Postgres, Slack, a browser) once, and any MCP-capable harness can use it. Treat MCP servers as third-party dependencies: review what they expose, pin versions, and remember that every connected server's tool descriptions enter your prompt (use deferred loading / tool search when the catalog is large). Coverage in depth: Coding Agent Tool Design.

Memory and Context

The context window is the agent's working memory, and it is the scarcest resource in the system. Two findings shape everything: effective attention degrades as context fills ("context rot" — models recall the middle of a long context worse than the edges), and inference cost scales with every token you keep resending. Long-horizon agents are therefore built on context hygiene, not maximal context.

Layer	Mechanism	Survives
Working memory	The message list itself	One run, until compaction
Compaction summary	Model summarizes the transcript; harness restarts the loop with the summary + recent turns	Context overflow
File-based memory	Agent writes notes, plans, TODOs to disk (`NOTES.md`, scratch files) and re-reads them	The session — and the next one
Project memory	Curated instruction files (`CLAUDE.md`-style) loaded every session	The project
Retrieval	Search tools over a corpus or past episodes	Everything else

Practical defaults:

Compaction preserves decisions, constraints, file paths, and learned gotchas; it discards raw tool output. Trigger it at a threshold (e.g., 80% of the window), not at overflow.
The filesystem is the agent's external memory. An agent that maintains its own plan.md and checks items off recovers from compaction and interruption almost for free.
Just-in-time retrieval beats preloading. Give the agent search tools (grep, semantic search) and let it pull what it needs, instead of stuffing everything that might be relevant into the prompt. See Agent Context Engineering.
Vector-store "agent memory" (embed every message, retrieve by similarity) is rarely the right first tool — explicit files the agent deliberately writes and reads are more debuggable and more faithful.

Verification: The Half of the Loop That Matters

Agents shine on tasks where checking an answer is cheaper than producing it — code with a test suite, data transformations with invariants, UI changes you can screenshot. The harness should make ground truth available:

Prefer objective verifiers (exit codes, diffs, pixel comparisons) over the model grading itself; self-evaluation without ground truth inflates success rates.
Make verification cheap and incremental: a fast targeted test the agent can run after every edit outperforms a 20-minute suite it runs once.
For tasks with no programmatic oracle, use rubric-based LLM-as-judge as a weak signal and route low-confidence results to a human.
A task with no verification signal at all is a poor fit for autonomy — keep a human in the loop.

This is also the honest framing for reliability: per-step success compounds. A 98%-per-step agent finishes a 30-step task ~55% of the time. Verification steps are how you stop the compounding — they convert silent drift into a visible, recoverable error.

Security and Sandboxing

An agent is an untrusted-code execution problem plus a confused-deputy problem. The harness, not the model, is the security boundary.

Sandbox the environment. Run tool execution in an ephemeral container or VM: project directory mounted read-write, everything else read-only or invisible; network egress through an allowlist; secrets injected per-tool, never placed in context.
Classify and gate actions. Reads auto-approve; writes inside the workspace auto-approve or batch for review; anything irreversible or outward-facing (push, deploy, send email, spend money) requires explicit approval until you have eval evidence to relax it.
Assume prompt injection. Any untrusted content the agent reads — web pages, issues, emails, tool outputs — may contain instructions. Pattern-matching filters do not solve this. The structural defense is to avoid the lethal trifecta: an agent that simultaneously (1) reads untrusted input, (2) has access to private data, and (3) can communicate externally is exfiltratable by design. Remove or gate at least one leg.
Provenance matters. Mark tool results as data, not directives, in the prompt; treat "the issue comment told me to" as a bug in your harness, not the model.

python

IRREVERSIBLE = {"deploy", "send_email", "git_push", "payment"}

async def execute_gated(tool: str, args: dict, policy: Policy) -> str:
    action_class = classify(tool, args)          # read | write | irreversible
    if action_class == "irreversible" and not policy.pre_approved(tool, args):
        approval = await request_human_approval(tool, args)
        if not approval.granted:
            return f"Denied by operator: {approval.reason}"   # the model adapts
    return await sandbox.run(tool, args)

Evaluating Agents

You cannot improve a harness you cannot measure. Public benchmarks calibrate expectations — SWE-bench Verified (real GitHub issues), Terminal-Bench (terminal tasks), τ-bench (tool use under policy constraints, with simulated users), OSWorld (computer use), GAIA (tool-augmented reasoning) — but your product needs its own eval set: 50–200 real tasks with programmatic graders, run on every harness change. Track task success, cost per solved task, turns to completion, and unsafe-action rate, not just model-level scores. Pass@k vs pass^k matters for agents: a task solved 1-in-8 runs is a very different product than one solved 8-in-8.

Best Practices

1. SIMPLEST THING FIRST
   Single call → workflow → agent. Earn each step up in autonomy
   with evidence the simpler form fails.

2. DESIGN THE ENVIRONMENT, NOT JUST THE PROMPT
   Fast tests, clear errors, inspectable state. Agents are only as
   good as their feedback loops.

3. SMALL ORTHOGONAL TOOL SURFACE
   Bash + files + code execution, plus structured tools for the
   irreversible stuff. Consolidate aggressively.

4. CONTEXT HYGIENE OVER CONTEXT SIZE
   Stable prompt prefix (cache), just-in-time retrieval, compaction,
   files as memory.

5. VERIFY WITH GROUND TRUTH
   Tests, typecheckers, screenshots. The model claims; the harness
   confirms.

6. SANDBOX BY DEFAULT, GATE THE IRREVERSIBLE
   Containerized execution, egress allowlists, approval for
   outward-facing actions. Avoid the lethal trifecta.

7. BOUND EVERYTHING
   Max turns, token budgets, wall-clock timeouts, spend limits.
   Autonomy without budgets is an incident report.

8. INSTRUMENT EVERY TURN
   Traces with tokens, tools, cache hits, interventions. Evals on
   every harness change.

References

Building Effective Agents — Anthropic's workflow/agent taxonomy
Effective Context Engineering for AI Agents — Anthropic
Writing Effective Tools for Agents — Anthropic
Code Execution with MCP — scripts instead of chained tool calls
Model Context Protocol — specification and SDKs
The Lethal Trifecta for AI Agents — Simon Willison
SWE-bench / τ-bench / GAIA — agent benchmarks
Context Rot: How Increasing Input Tokens Impacts LLM Performance — Chroma research
ReAct: Synergizing Reasoning and Acting in Language Models — the 2022 pattern that native tool calling productized

LLM Agent Fundamentals ​

TL;DR ​

What Makes an Agent Different from a Chatbot? ​

The Three Components ​

The Agent Loop ​

Extended thinking ​

Tools ​

Design principles ​

General-purpose vs. structured tools ​

MCP: the integration standard ​

Memory and Context ​

Verification: The Half of the Loop That Matters ​

Security and Sandboxing ​

Evaluating Agents ​

Best Practices ​

References ​