The OpenAI API is the fastest path from "I have a cool idea" to a working AI-powered product. Whether you want to build a customer support bot, a code reviewer, an image-analysis tool, or a fully autonomous agent — it all starts with a handful of API calls. This guide takes you from account creation to production-ready patterns, with real Python code you can run today.
Getting Your API Key & Setup
Everything begins at platform.openai.com. Sign up or log in, navigate to API Keys in the left sidebar, and click Create new secret key. Copy it immediately — you won't see it again after closing the dialog.
Next, install the official Python SDK and set up your environment:
bash# Create a virtual environment (recommended) python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate # Install the latest OpenAI SDK (v1+) pip install openai python-dotenv
Store your key as an environment variable — never paste it directly in code. Create a .env file:
.envOPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Add .env to your .gitignore right now. One accidental commit can expose your key to the world and rack up charges you didn't expect.
2025 Model Pricing Cheat Sheet
Choosing the right model is the single biggest cost lever you have. Here's the current pricing landscape (per 1 million tokens):
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Best For | Context Window |
|---|---|---|---|---|
| gpt-4o | $2.50 | $10.00 | Complex reasoning, vision, production quality | 128k tokens |
| gpt-4o-mini | $0.15 | $0.60 | High-volume tasks, prototyping, classification | 128k tokens |
| gpt-3.5-turbo | $0.50 | $1.50 | Legacy integrations, simple Q&A | 16k tokens |
| o4-mini | $1.10 | $4.40 | Math, code, multi-step reasoning tasks | 128k tokens |
| text-embedding-3-small | $0.02 | — | Semantic search, RAG, similarity | 8k tokens |
Start with gpt-4o-mini
At $0.15/1M input tokens, gpt-4o-mini is roughly 17× cheaper than gpt-4o and handles the vast majority of tasks excellently. Switch to gpt-4o only when you hit a real quality ceiling, not before.
Your First API Call
The OpenAI SDK v1+ uses a clean, synchronous client pattern. Every chat request sends an array of messages, each with a role (system, user, or assistant) and content. The system message sets the model's persona and constraints; the user message is what you're asking.
Pythonimport os from openai import OpenAI from dotenv import load_dotenv load_dotenv() # loads OPENAI_API_KEY from .env # The client automatically reads OPENAI_API_KEY from the environment client = OpenAI() response = client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": "You are a concise technical assistant. Answer in plain English with no markdown." }, { "role": "user", "content": "Explain what a transformer attention mechanism does in two sentences." } ], temperature=0.3, # lower = more deterministic max_tokens=200, # cap output length to control costs ) # The model's reply lives here print(response.choices[0].message.content) # Check how many tokens were used this request print(f"Tokens used — prompt: {response.usage.prompt_tokens}, " f"completion: {response.usage.completion_tokens}")
Running this prints something like: "Attention lets every token in a sequence directly look at every other token to figure out which ones matter most for understanding the current word. It does this by computing weighted sums of value vectors, where the weights come from similarity scores between query and key vectors." Clean, factual, two sentences.
Never Hardcode Your API Key
Writing api_key="sk-proj-..." directly in Python is a critical security mistake. If that file ever touches version control — even a private repo — bots scan GitHub 24/7 for leaked keys. Always use environment variables or a secrets manager like AWS Secrets Manager or HashiCorp Vault.
Streaming Responses
By default, the API waits until the full response is generated before returning anything. For a 500-token response at modest speed, that could be 10+ seconds of staring at a blank screen. Streaming fixes this: the API sends tokens to your client as they're generated, just like you see in ChatGPT.
Enable streaming by passing stream=True and iterating over the returned object:
Pythonfrom openai import OpenAI from dotenv import load_dotenv load_dotenv() client = OpenAI() with client.chat.completions.stream( model="gpt-4o-mini", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a haiku about building software at 2am."} ], max_tokens=100, ) as stream: for text in stream.text_stream(): print(text, end="", flush=True) print() # newline after stream ends
The SDK's stream() context manager (v1.8+) is the cleanest approach. If you're on an older SDK version and need the raw approach, here's the low-level loop equivalent:
Python# Low-level streaming (compatible with all SDK v1+ versions) stream = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "List 5 Python tips for beginners."}], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta is not None: print(delta, end="", flush=True) print()
Each chunk is a partial response object. chunk.choices[0].delta.content holds the new token(s) for this chunk — it can be None on the final chunk, so always guard against that. In a web app, you'd pipe these chunks straight into a Server-Sent Events (SSE) response for a ChatGPT-like effect in the browser.
Vision: Analyzing Images with GPT-4o
GPT-4o is natively multimodal — it can look at images and answer questions about them. You pass images either as a public URL or as a base64-encoded string, both inline in the messages array.
Sending an Image URL
Pythonresponse = client.chat.completions.create( model="gpt-4o", # gpt-4o-mini also supports vision messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png", "detail": "low" # "low" or "high" — low saves tokens } }, { "type": "text", "text": "Describe what you see in this image in one sentence." } ] } ], max_tokens=150, ) print(response.choices[0].message.content)
Sending a Local Image as Base64
Pythonimport base64 def encode_image(image_path: str) -> str: with open(image_path, "rb") as f: return base64.b64encode(f.read()).decode("utf-8") image_data = encode_image("screenshot.png") response = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": f"data:image/png;base64,{image_data}" } }, { "type": "text", "text": "Identify any errors or warnings visible in this screenshot." } ] } ], max_tokens=300, ) print(response.choices[0].message.content)
Supported Image Formats & Limits
GPT-4o Vision accepts PNG, JPEG, WEBP, and GIF (non-animated). Maximum image size is 20 MB per image. You can include up to 10 images in a single request. The detail: "low" setting costs a flat 85 tokens per image — great for simple descriptions. Use detail: "high" when you need fine-grained analysis of charts, screenshots, or diagrams, which tiles the image and costs proportionally more.
Function Calling (Tool Use)
Function calling is the mechanism behind AI agents. Instead of just returning text, the model can decide to call a function you've defined — returning structured JSON arguments you can pass to real code. This is how you connect an LLM to databases, APIs, calculators, or any external service.
The flow works like this: you describe available tools in JSON Schema → the model decides which tool to call and with what arguments → you execute the function → you feed the result back to the model → the model produces a final response.
Step 1 — Define the Tool Schema
Pythonimport json from openai import OpenAI from dotenv import load_dotenv load_dotenv() client = OpenAI() # Define the tools the model can call tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a given city. Call this whenever the user asks about weather.", "parameters": { "type": "object", "properties": { "city": { "type": "string", "description": "The city name, e.g. 'London' or 'Tokyo'" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit" } }, "required": ["city"] } } } ]
Step 2 — Let the Model Decide to Call a Tool
Pythonmessages = [ {"role": "user", "content": "What's the weather like in Karachi right now?"} ] response = client.chat.completions.create( model="gpt-4o-mini", messages=messages, tools=tools, tool_choice="auto", # let the model decide when to use tools ) assistant_msg = response.choices[0].message # Check if the model wants to call a function if assistant_msg.tool_calls: tool_call = assistant_msg.tool_calls[0] func_name = tool_call.function.name func_args = json.loads(tool_call.function.arguments) print(f"Model wants to call: {func_name}({func_args})") # → Model wants to call: get_weather({'city': 'Karachi', 'unit': 'celsius'})
Step 3 — Execute the Function & Return the Result
Python# Your real implementation would call a weather API here def get_weather(city: str, unit: str = "celsius") -> dict: # Stub — replace with requests.get("https://api.openweathermap.org/...") return {"city": city, "temperature": 34, "unit": unit, "condition": "Sunny and humid"} # Call our function with the model-provided arguments result = get_weather(**func_args) # Feed the function result back to the model for the final response messages.append(assistant_msg) # append the tool_call assistant message messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result) }) final_response = client.chat.completions.create( model="gpt-4o-mini", messages=messages, ) print(final_response.choices[0].message.content) # → "Right now in Karachi it's 34°C and sunny with high humidity. Quite hot!"
This Is How AI Agents Work
Agents are just this loop — repeated. The model calls a tool, gets a result, decides if it needs more information, calls another tool, and so on until it has enough to answer. Frameworks like LangChain, LlamaIndex, and OpenAI's own Assistants API automate this loop for you, but the underlying primitive is always the same function-calling mechanism shown above.
Generating Embeddings
Embeddings convert text into dense numerical vectors that capture semantic meaning. Similar texts end up close together in vector space — which makes them the foundation of semantic search, RAG pipelines, recommendation systems, and anomaly detection. OpenAI's text-embedding-3-small model gives you 1536-dimensional vectors at an extremely low cost ($0.02/1M tokens).
Pythonimport numpy as np from openai import OpenAI from dotenv import load_dotenv load_dotenv() client = OpenAI() def get_embedding(text: str) -> list[float]: response = client.embeddings.create( model="text-embedding-3-small", input=text, encoding_format="float" ) return response.data[0].embedding def cosine_similarity(a: list, b: list) -> float: a, b = np.array(a), np.array(b) return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))) # Embed three sentences and find the most similar pair sentences = [ "How do I reset my password?", "I forgot my login credentials and can't access my account.", "What are the opening hours of your store?" ] embeddings = [get_embedding(s) for s in sentences] sim_01 = cosine_similarity(embeddings[0], embeddings[1]) sim_02 = cosine_similarity(embeddings[0], embeddings[2]) print(f"'Reset password' ↔ 'Forgot credentials': {sim_01:.3f}") # → ~0.89 print(f"'Reset password' ↔ 'Store opening hours': {sim_02:.3f}") # → ~0.31
The high similarity (~0.89) between the first two sentences means they'd be retrieved together in a semantic search, even though they share no keywords. This is the core insight behind RAG: instead of keyword matching, you match meaning. Store your embeddings in a vector database (Pinecone, Qdrant, pgvector) for production-scale retrieval.
Cost & Best Practices
The difference between a $50/month AI app and a $5,000/month one often comes down to a few engineering decisions made early. These are the levers that matter most:
| Strategy | Typical Savings | Implementation Effort | Notes |
|---|---|---|---|
Use gpt-4o-mini instead of gpt-4o |
Up to 94% | Low — change model string | Quality gap is smaller than you think for most tasks |
Set a max_tokens limit |
10–40% | Low — one parameter | Prevents runaway outputs; tune per use case |
| Cache repeated prompts | 30–80% on cache-hit requests | Medium — add Redis/disk cache | OpenAI auto-discounts prompts >1024 tokens that are reused |
| Trim system prompts | 5–20% | Low — review prompt length | Every token in every request adds up at scale |
| Batch embedding requests | Latency savings | Low — pass a list | Pass up to 2048 texts in one API call instead of N calls |
Use temperature=0 for deterministic tasks |
Indirect (fewer retries) | Low — one parameter | Reduces hallucinations in classification, extraction tasks |
Four Essential Best Practices
max_tokens appropriate to your use case — 100 for summaries, 500 for explanations, 2000 for code generation. This also prevents prompt injection attacks that try to elicit huge outputs.gpt-4o-mini and complex reasoning or vision tasks to gpt-4o or o4-mini. Many companies achieve GPT-4-quality results at 80% lower cost this way.RateLimitError and APITimeoutError. The tenacity library makes this trivial to implement and is worth adding on day one.Production-Ready Error Handling
Pythonimport time from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError client = OpenAI() def chat_with_retry(messages: list, max_retries: int = 5) -> str: """Call the API with exponential backoff on transient errors.""" for attempt in range(max_retries): try: response = client.chat.completions.create( model="gpt-4o-mini", messages=messages, max_tokens=500, timeout=30.0, ) return response.choices[0].message.content except RateLimitError: wait = 2 ** attempt # 1, 2, 4, 8, 16 seconds print(f"Rate limited. Retrying in {wait}s… (attempt {attempt + 1}/{max_retries})") time.sleep(wait) except (APITimeoutError, APIConnectionError) as e: if attempt == max_retries - 1: raise wait = 2 ** attempt print(f"Connection error: {e}. Retrying in {wait}s…") time.sleep(wait) raise RuntimeError("Max retries exceeded") # Usage answer = chat_with_retry([{"role": "user", "content": "What is 2+2?"}]) print(answer)
Where to Go Next
Once you're comfortable with the basics, explore the Assistants API for built-in thread management and file retrieval, Structured Outputs (response_format={"type": "json_schema"}) to get guaranteed-valid JSON, and the Realtime API for low-latency voice applications. Each of these unlocks an entirely new class of products.