The word "agent" is everywhere in AI right now — but most explanations either wave their hands or throw you into a 400-line LangGraph workflow. Neither helps you actually understand what an agent is. So let's skip the abstractions. By the end of this post you'll have a working Python agent — one that reasons, picks tools, runs them, and loops until it has a final answer — built entirely from scratch.
No LangChain. No LlamaIndex. No magic. Just the OpenAI SDK, the inspect module, and roughly 80 lines of Python.
What Makes Something an "Agent"?
An LLM is a fancy text-completion function. You give it tokens, it gives you tokens. An agent is that same LLM wrapped in a loop that can observe its environment, make decisions, take actions, and remember what it has already done. Four properties define an agent:
The Loop is the Agent
The difference between an LLM and an agent is the loop. A raw LLM call is a single shot: in → out. An agent keeps calling the LLM, running tools, and feeding results back until the task is genuinely complete. That loop is the whole secret.
The ReAct pattern (Reason + Act) formalises this: at each step the model produces a thought (silent reasoning), an action (tool call), and an observation (tool result) — repeating until it emits a final answer. OpenAI's native tool-calling API implements exactly this pattern, just with structured JSON instead of free-text scratchpads.
Step 1: Define Tools
Tools are plain Python functions. The only constraint is that they must have type-annotated parameters and a docstring — we'll use those to auto-generate the JSON schema the model expects.
Python# tools.py — three plain functions, zero framework import json, math, urllib.request, urllib.parse def get_weather(city: str) -> str: """Return the current weather for a given city. Args: city: The name of the city, e.g. 'London' or 'New York'. """ # In production, call a real weather API here. # For the demo we return canned data. weather_db = { "london": "12°C, overcast", "new york": "24°C, sunny", "tokyo": "18°C, partly cloudy", "dubai": "38°C, clear", } key = city.lower().strip() return weather_db.get(key, f"Weather data not available for {city}.") def calculate(expression: str) -> str: """Evaluate a safe mathematical expression and return the result. Args: expression: A mathematical expression string, e.g. '2 ** 10' or 'sqrt(144)'. """ # Expose only safe math symbols — never use bare eval() in production! allowed = {k: getattr(math, k) for k in dir(math) if not k.startswith("_")} allowed["__builtins__"] = {} try: result = eval(expression, allowed) # noqa: S307 return str(result) except Exception as e: return f"Error evaluating expression: {e}" def search_web(query: str) -> str: """Search the web for up-to-date information on a topic. Args: query: The search query string, e.g. 'latest GPT-4o benchmarks'. """ # Stub — swap in SerpAPI / Brave Search API / Tavily in production. encoded = urllib.parse.quote_plus(query) return ( f"[Simulated search results for: {query}]\n" "1. Researchers at MIT publish new benchmark showing 40% improvement...\n" "2. Industry report: adoption of agentic AI up 3x in 2025...\n" "3. OpenAI releases o3-mini with improved reasoning at lower cost..." ) # Collect all tools in one list for easy import ALL_TOOLS = [get_weather, calculate, search_web]
Now let's auto-generate the JSON schema OpenAI needs. Instead of writing it by hand (error-prone and tedious), we'll introspect each function with the inspect module:
Python# schema.py — build OpenAI tool schemas from function signatures import inspect, re from typing import Callable, get_type_hints # Map Python types → JSON Schema types PYTHON_TO_JSON = {str: "string", int: "integer", float: "number", bool: "boolean"} def _parse_arg_docs(docstring: str) -> dict: """Extract per-argument descriptions from a Google-style docstring.""" arg_docs = {} if not docstring: return arg_docs # Match lines under the 'Args:' section args_section = re.search(r"Args:\n(.*?)(?:\n\n|\Z)", docstring, re.DOTALL) if not args_section: return arg_docs for match in re.finditer(r"\s{8}(\w+):\s(.+)", args_section.group(1)): arg_docs[match.group(1)] = match.group(2).strip() return arg_docs def function_to_schema(func: Callable) -> dict: """Convert a Python function into an OpenAI tool schema dict.""" hints = get_type_hints(func) sig = inspect.signature(func) doc = inspect.getdoc(func) or "" arg_docs = _parse_arg_docs(doc) # First paragraph of the docstring = tool description description = doc.split("\n\n")[0].strip() properties, required = {}, [] for name, param in sig.parameters.items(): json_type = PYTHON_TO_JSON.get(hints.get(name), "string") properties[name] = { "type": json_type, "description": arg_docs.get(name, ""), } if param.default is inspect.Parameter.empty: required.append(name) return { "type": "function", "function": { "name": func.__name__, "description": description, "parameters": { "type": "object", "properties": properties, "required": required, "additionalProperties": False, }, }, } def build_tool_schemas(functions: list) -> list: """Build a list of OpenAI-compatible tool schemas from a list of functions.""" return [function_to_schema(f) for f in functions]
Why auto-generate schemas?
Hand-writing JSON schemas for every function is tedious, error-prone, and means you have two things to keep in sync. Generating from the signature means adding a new tool is just writing a typed Python function — no schema boilerplate needed.
Step 2: Implement Tool Calling
OpenAI's tool-calling flow has three steps: send the tools list with the user message, receive a response that may contain tool_calls, dispatch each call to the matching Python function, then send the results back. Here's the dispatch logic in isolation:
Python# dispatcher.py — call the right function given a tool_call object import json from tools import ALL_TOOLS # Build a name → callable registry TOOL_REGISTRY: dict = {fn.__name__: fn for fn in ALL_TOOLS} def dispatch_tool_call(tool_call) -> str: """Execute a single OpenAI tool_call and return the string result.""" name = tool_call.function.name arguments = json.loads(tool_call.function.arguments) if name not in TOOL_REGISTRY: return f"Error: unknown tool '{name}'" fn = TOOL_REGISTRY[name] try: result = fn(**arguments) return str(result) except TypeError as e: return f"Error calling {name}: {e}" def handle_tool_calls(response_message, messages: list) -> list: """Process all tool_calls in a response; append results to messages.""" # 1. Append the assistant's message (which contains the tool_call requests) messages.append(response_message) # 2. Execute each tool call and append the result for tc in response_message.tool_calls: result = dispatch_tool_call(tc) messages.append({ "role": "tool", "tool_call_id": tc.id, "content": result, }) return messages
Notice that the tool result message must include the original tool_call_id — this lets the model match each result back to its specific request when multiple tools are called in parallel.
Step 3: The Agent Loop
The loop is the heart of the agent. It's simpler than most people expect — essentially a while True with a safety exit:
Python# agent.py — the core agent loop (~50 lines) import os from openai import OpenAI from tools import ALL_TOOLS from schema import build_tool_schemas from dispatcher import handle_tool_calls client = OpenAI(api_key=os.environ["OPENAI_API_KEY"]) TOOL_SCHEMAS = build_tool_schemas(ALL_TOOLS) SYSTEM_PROMPT = """You are a research assistant with access to tools. Think step-by-step. Use tools whenever you need real data. When you have enough information, write a clear final answer.""" def run_agent(user_query: str, max_iterations: int = 10) -> str: """Run the agent loop until a final answer is produced.""" messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": user_query}, ] for iteration in range(max_iterations): print(f"\n── Iteration {iteration + 1} ──") response = client.chat.completions.create( model="gpt-4o", messages=messages, tools=TOOL_SCHEMAS, tool_choice="auto", # let the model decide ) msg = response.choices[0].message stop_reason = response.choices[0].finish_reason # ── Case 1: the model chose to call one or more tools ── if stop_reason == "tool_calls": for tc in msg.tool_calls: args = json.loads(tc.function.arguments) print(f" 🔧 {tc.function.name}({args})") messages = handle_tool_calls(msg, messages) continue # loop back with tool results appended # ── Case 2: the model is done — return the final answer ── if stop_reason == "stop": return msg.content # ── Case 3: unexpected finish reason ── return f"Unexpected stop reason: {stop_reason}" return "Max iterations reached — partial answer in message history." if __name__ == "__main__": import json answer = run_agent( "What's the weather in London and Tokyo? " "Also, what is 2 to the power of 16?" ) print("\n── Final Answer ──\n") print(answer)
Always set max_iterations
Without a guard, a bug in your tool (e.g. one that always returns an error) can put the agent in an infinite loop and burn your API credits. max_iterations=10 is a safe default for most tasks; raise it for complex multi-step research.
When you run this, you'll see the model call get_weather("London") and get_weather("Tokyo") in the same iteration (parallel tool calling — GPT-4o supports this natively), then in the next iteration call calculate("2 ** 16"), and finally emit a clean prose answer with all three results incorporated.
Step 4: Adding Memory
The messages list is the agent's short-term memory. Every tool call, result, and assistant thought is appended — the model sees the full history at every step. But context windows are finite, so you need a strategy for long-running agents.
Python# memory.py — two strategies for managing context length from openai import OpenAI client = OpenAI() # ── Strategy 1: Sliding window — keep last N messages ────────────────────── def trim_messages_sliding(messages: list, keep_last: int = 20) -> list: """Keep the system prompt + the most recent `keep_last` messages.""" system_msgs = [m for m in messages if m["role"] == "system"] other_msgs = [m for m in messages if m["role"] != "system"] return system_msgs + other_msgs[-keep_last:] # ── Strategy 2: Summarisation — compress old turns into one message ───────── def trim_messages_summarise(messages: list, keep_recent: int = 6) -> list: """Summarise everything except the last `keep_recent` messages.""" system_msgs = [m for m in messages if m["role"] == "system"] other_msgs = [m for m in messages if m["role"] != "system"] if len(other_msgs) <= keep_recent: return messages # nothing to trim old_msgs = other_msgs[:-keep_recent] recent_msgs = other_msgs[-keep_recent:] # Ask the model to compress the old conversation summary_prompt = [ {"role": "system", "content": "Summarise the following conversation history in 3-5 bullet points. Preserve key facts and decisions."}, {"role": "user", "content": "\n".join(f"{m['role']}: {m.get('content', '')}" for m in old_msgs)}, ] summary_resp = client.chat.completions.create( model="gpt-4o-mini", messages=summary_prompt, ) summary_text = summary_resp.choices[0].message.content summary_msg = {"role": "assistant", "content": f"[Earlier conversation summary]\n{summary_text}"} return system_msgs + [summary_msg] + recent_msgs # ── Long-term memory (sketch) ──────────────────────────────────────────────── # For persistent memory across sessions, store key facts in a vector DB. # At the start of each session: retrieve relevant facts → inject into system prompt. # Tools like Mem0 or a simple Chroma/Pinecone store work well here.
Here's a comparison of memory approaches to help you pick the right one:
| Strategy | Best For | Pros | Cons |
|---|---|---|---|
| Full history | Short tasks (< 20 turns) | Zero overhead, perfect recall | Hits context limit fast |
| Sliding window | Conversational agents | Simple, predictable token usage | Loses early context entirely |
| Summarisation | Long research sessions | Preserves key facts | Extra API call cost; summary may lose detail |
| Vector DB (long-term) | Persistent user profiles | Survives restarts, scales infinitely | Infrastructure overhead, retrieval noise |
Short-term vs Long-term Memory
Short-term memory is the messages list — it lives inside a single agent run and is lost when the process ends. Long-term memory requires an external store (a database, a vector index) that persists between runs. Most agents only need short-term; add long-term when users expect the agent to remember them across sessions.
Complete Working Example
Let's put everything together into a "Research Agent" that can search the web, do calculations, check weather, and then write a polished final report. This is the complete, runnable file:
Python#!/usr/bin/env python3 """ research_agent.py — A complete AI agent in ~80 lines. No framework. No magic. Just the OpenAI API. Requirements: pip install openai python-dotenv export OPENAI_API_KEY=sk-... """ import inspect, json, math, os, re, urllib.parse from typing import Callable from openai import OpenAI client = OpenAI() # ── 1. Define tools ────────────────────────────────────────────────────────── def get_weather(city: str) -> str: """Return current weather for a city. Args: city: City name, e.g. 'Paris'. """ db = {"london": "12°C overcast", "new york": "24°C sunny", "tokyo": "18°C cloudy"} return db.get(city.lower(), f"No data for {city}") def calculate(expression: str) -> str: """Evaluate a mathematical expression safely. Args: expression: Math expression, e.g. 'sqrt(2) * 100'. """ ns = {k: getattr(math, k) for k in dir(math) if not k.startswith("_")} ns["__builtins__"] = {} try: return str(eval(expression, ns)) # noqa: S307 except Exception as e: return f"Error: {e}" def search_web(query: str) -> str: """Search the web for up-to-date information. Args: query: Search query string. """ # Replace with Tavily / Brave / SerpAPI in production return ( f"[Results for: {query}]\n" "• Study: agentic AI frameworks reduce task completion time by 60%\n" "• Report: Python remains #1 language for AI development in 2025\n" "• OpenAI o3 sets new records on SWE-bench (71.7% pass rate)" ) ALL_TOOLS = [get_weather, calculate, search_web] # ── 2. Auto-generate schemas ───────────────────────────────────────────────── PYTHON_TO_JSON = {str: "string", int: "integer", float: "number", bool: "boolean"} def fn_to_schema(fn: Callable) -> dict: sig = inspect.signature(fn) doc = inspect.getdoc(fn) or "" desc = doc.split("\n\n")[0].strip() arg_desc = {} for m in re.finditer(r"\s{8}(\w+):\s(.+)", doc): arg_desc[m.group(1)] = m.group(2) props, req = {}, [] from typing import get_type_hints hints = get_type_hints(fn) for name, param in sig.parameters.items(): props[name] = {"type": PYTHON_TO_JSON.get(hints.get(name), "string"), "description": arg_desc.get(name, "")} if param.default is inspect.Parameter.empty: req.append(name) return {"type": "function", "function": {"name": fn.__name__, "description": desc, "parameters": {"type": "object", "properties": props, "required": req}}} SCHEMAS = [fn_to_schema(f) for f in ALL_TOOLS] REGISTRY = {fn.__name__: fn for fn in ALL_TOOLS} # ── 3. Tool dispatcher ─────────────────────────────────────────────────────── def run_tool(tc) -> dict: name = tc.function.name args = json.loads(tc.function.arguments) try: result = REGISTRY[name](**args) except Exception as e: result = f"Error: {e}" return {"role": "tool", "tool_call_id": tc.id, "content": str(result)} # ── 4. The agent loop ──────────────────────────────────────────────────────── SYSTEM = """You are a research assistant. Use tools to gather real data. When you have enough information, write a well-structured final report.""" def run_agent(query: str, max_iter: int = 8) -> str: messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": query}] for i in range(max_iter): resp = client.chat.completions.create(model="gpt-4o", messages=messages, tools=SCHEMAS, tool_choice="auto") msg = resp.choices[0].message reason = resp.choices[0].finish_reason if reason == "tool_calls": messages.append(msg) for tc in msg.tool_calls: print(f" 🔧 {tc.function.name}({json.loads(tc.function.arguments)})") messages.append(run_tool(tc)) elif reason == "stop": return msg.content else: break return "Agent stopped — max iterations reached." # ── 5. Run it ──────────────────────────────────────────────────────────────── if __name__ == "__main__": report = run_agent( "Research report: Compare the weather in London vs Tokyo today, " "calculate what percentage warmer Tokyo is, and summarise recent " "trends in agentic AI development. Write a final 3-paragraph report." ) print("\n═══ FINAL REPORT ═══\n") print(report)
A sample run against GPT-4o produces roughly this execution trace:
Output🔧 get_weather({'city': 'London'}) 🔧 get_weather({'city': 'Tokyo'}) 🔧 calculate({'expression': '(18 - 12) / 12 * 100'}) 🔧 search_web({'query': 'agentic AI development trends 2025'}) ═══ FINAL REPORT ═══ **Weather Comparison: London vs Tokyo** London currently sits at 12°C with overcast skies, while Tokyo is warmer at 18°C under partly cloudy conditions — making Tokyo 50% warmer than London today. **Agentic AI Trends** Recent research confirms that agentic frameworks reduce task completion time by up to 60% compared to single-shot LLM prompting. Python continues to dominate AI development, powering the majority of agent implementations in 2025. OpenAI's o3 model has pushed the frontier further, achieving 71.7% on SWE-bench... **Conclusion** ...
Three tool calls, three observations, one loop-back — and the model synthesises a coherent multi-source report. The total token cost for this run is roughly 800 input tokens plus whatever the response length is.
When to Add a Framework
Building from scratch is the best way to truly understand agents. But frameworks exist for good reasons — once you're past the learning stage, they save real engineering time.
| Scenario | Scratch | Framework (LangGraph / CrewAI) |
|---|---|---|
| Single agent, 2–5 tools | ✅ Ideal | Overkill |
| Multi-agent orchestration | Hard to scale | ✅ LangGraph |
| Complex branching state | Re-inventing the wheel | ✅ LangGraph |
| Role-based agent teams | Lots of glue code | ✅ CrewAI / AutoGen |
| Production tracing / monitoring | Build your own logger | ✅ LangSmith / Langfuse |
| Custom memory + retrieval | ✅ Full control | Often leaky abstractions |
The right mental model
Think of frameworks as pre-built plumbing for common agent patterns. If your use case fits the pattern, use the framework. If it doesn't, the framework fights you. The code in this post is what every framework is doing under the hood — knowing it cold means you're never confused by what any framework is doing.
Rule of thumb: if you have more than one agent, non-trivial state transitions, or you need production-grade observability — reach for LangGraph. If you have a single agent with a handful of tools and a simple loop, the approach in this post is cleaner, faster to iterate on, and has fewer dependencies to break.