OpenAI’s Offering to Assist Engineers to Build Agents

September 29, 2025 / Naixian Zhang

Here is a complete, structured list of the capabilities / “tools” OpenAI currently provides for agents (through ChatGPT and the API), summarized by OpenAI itself:

1. Reasoning & General Models

GPT models (GPT-4o, GPT-4.1, GPT-o3, GPT-o1, Codex variants)
- Text, code, multimodal reasoning (text, images, audio).
- Used as the “core brain” of agents.

2. Code & Data Execution

Code Interpreter / Python (a.k.a. Advanced Data Analysis, ADA)
- Run Python code in a sandbox.
- Upload, analyze, and transform files (CSV, Excel, JSON, images, etc.).
- Generate visualizations, do math/stats, automate file processing.
Codex / GPT-5-Codex
- Code generation, editing, refactoring, debugging, running tests.
- Can propose PRs, work in IDEs, or run tasks in cloud sandboxes.

3. Search & Retrieval

Web Search Tool
- Real-time web browsing for up-to-date information.
File Search (vector store memory)
- Search across uploaded docs or persistent memory.
- Custom GPTs can be hooked to private knowledge bases (via RAG).

4. Voice & Multimodal Interaction

Voice Mode / Voice Agents
- Natural real-time conversation with low latency.
- Speech-to-Text (Whisper) + expressive Text-to-Speech voices.
Image Input & Analysis
- Upload images for description, OCR, analysis.
Audio Input & Output
- Transcribe audio (Whisper).
- Synthesize voices (Expressive TTS in GPT-4o).

5. Customizability & Extensions

Custom GPTs
- No-code way to create specialized agents with instructions, knowledge, and API connections.
Function Calling / Tool Use
- Structured way for GPT to call APIs, run functions, or trigger external systems.
Memory
- Persistent long-term memory across conversations (can recall user preferences, past files).

6. Ecosystem & Integration

Plugins (legacy, now replaced by function calling / custom GPTs)
- Connect to third-party apps (e.g., Slack, Zapier, Expedia).
Assistants API (for developers)
- Exposes the same building blocks (threads, tool use, file handling, code execution) for embedding agents into apps.
Multi-modal foundation (GPT-4o)
- One unified model that can handle text, code, images, audio, and voice in a single session.

But on its developer’s website, you’ll find an accurate list of tools!

Leave a comment Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.