Inside ChatGPT's Intelligence Architecture: How One Chat Interface Routes Between Canvas, Deep Research, Code, Thinking, Web Search, and Image Generation

When you type a message into ChatGPT, something remarkable happens behind the scenes. A single chat interface seamlessly routes your request to the right capability—whether that's opening Canvas for collaborative writing, launching Deep Research for multi-hour investigations, spinning up a Python interpreter for data analysis, engaging extended thinking for complex reasoning, searching the web for current information, or generating and editing images. All without you explicitly selecting a mode.

How does ChatGPT know which tool to use? And how does it decide whether to generate a new image or edit an existing one? The answers reveal one of the most sophisticated AI orchestration systems ever deployed to consumers.

The Paradigm Shift: From Plugins to Unified Intelligence

The Old World: Explicit Tool Selection

Early ChatGPT required users to manually enable and select plugins. Want web browsing? Toggle it on. Need code execution? Enable Code Interpreter. Want image generation? Call DALL-E explicitly. This created friction and required users to understand which tool suited their task.

The New World: Invisible Orchestration

Modern ChatGPT operates as what researchers call a "Mixture of Models" (MoM) architecture—if the breakthrough from GPT-3 to GPT-4 was Mixture of Experts, then the breakthrough from GPT-4o to GPT-5 is this intelligent routing system. The model itself decides which capabilities to engage, making tool selection invisible to users.

GPT-5 launched on August 7, 2025, marking the first "unified" system that combines reasoning abilities with fast responses under a single interface. Rather than requiring users to manually switch between models, GPT-5 automatically routes queries to the optimal backend.

The GPT-5 Router: The Traffic Controller

At the heart of ChatGPT's orchestration is the GPT-5 Router—a sophisticated classification system that evaluates every input and determines the optimal processing path.

The Four Decision Factors

The router evaluates four primary signals when deciding how to handle your request:

1. Conversation Type

Is this casual chit-chat, a code review, a math proof, a story draft, or an image request? GPT-5 has learned which model handles each best:

Quick back-and-forth about weekend plans → Fast mode
Step-by-step derivation of a theorem → Thinking mode
"Write a blog post about coffee" → Canvas trigger
"Create a cyberpunk cityscape" → Image generation

2. Task Complexity

If your prompt looks tricky, GPT-5 doesn't hesitate to bring in its heavyweight reasoning model. The router spots subtle signals of difficulty in your words and allocates the appropriate compute resources.

3. Tool Needs

Mention a task like "calculate," "look up," "analyze this data," or "draw me," and the router knows to bring in a tool-equipped model. Unlike earlier systems where plugins had to be explicitly enabled, GPT-5 handles this invisibly.

4. Explicit Intent

Sometimes the router simply listens to you. If you write "think hard about this," it'll spin up the deep reasoning model. Subtle phrasing tweaks like "quickly summarize" versus "deeply analyze" cause GPT-5 to adjust modes on the fly—a new "soft instruction" layer where your wording nudges the router.

Continuous Learning

The router isn't static—it continuously improves using live signals:

Thumbs up/down on responses
User edits and retry patterns
Whether follow-up prompts lead to rerouting
Measured correctness of answers
How people manually choose models and their preferences

This creates a feedback loop where the router learns from millions of interactions daily.

The Mixture-of-Experts Foundation

Architectural Overview

GPT-5 uses a Mixture-of-Experts (MoE) architecture with 52.5 trillion parameters total—a 30x increase from GPT-4's estimated 1.76 trillion. But not all those parameters fire for every query.

The MoE setup means GPT-5 is composed of multiple specialized sub-models. A routing network scores which experts are most relevant given the current token and its context, then activates only a small subset (typically 2–8 experts out of dozens).

Selective Activation in Practice

Ask GPT-5 to debug code, and the "code expert" engages. Request image interpretation, and the visual reasoning expert takes charge. Ask it to generate an image, and the multimodal generation expert activates. This approach delivers incredible efficiency—instead of activating the entire model, GPT-5 uses just what's needed for each specific task.

This is sparse activation: most compute is off for most tokens. It's why streaming feels snappier—tokens get processed by focused sub-networks, aggregated, and sent forward.

The Model Family

The GPT-5 system card reveals the underlying model variants:

Model	Purpose
`gpt-5-main`	Fast, high-throughput responses
`gpt-5-main-mini`	Lightweight fast responses
`gpt-5-thinking`	Deep reasoning capabilities
`gpt-5-thinking-mini`	Efficient reasoning

The router dynamically selects among these based on the task at hand.

Canvas: Trained Trigger Detection

The Challenge

Canvas presents a unique UI challenge: when should the model open a collaborative editing interface versus just responding in chat? OpenAI had to train the model to distinguish between:

✅ "Write a blog post about the history of coffee beans" → Open Canvas
❌ "Help me cook a new recipe for dinner" → Stay in chat

Training Approach

OpenAI used novel synthetic data generation techniques, including distilling outputs from the o1 reasoning model. This enabled rapid improvement without relying on human-generated training data.

Key training challenges:

Trigger accuracy: Defining when to open Canvas while avoiding over-triggering
Edit vs. rewrite: Deciding when to make targeted edits versus rewriting entire content
Comment quality: Generating high-quality inline comments

For writing tasks, OpenAI prioritized "correct triggers" and achieved 83% accuracy compared to baseline zero-shot GPT-4o. For coding, they intentionally biased against triggering to avoid disrupting power users.

The Resulting Behavior

The trained model knows:

When to open Canvas: Content greater than ~10 lines, or scenarios where an editing interface helps
When to make targeted edits: When users explicitly select text through the interface
When to fully rewrite: When modification scope is broad or unclear

The integrated Canvas model outperforms zero-shot GPT-4o with prompted instructions by 30% in accuracy and 16% in quality.

Web Search: Automatic Information Retrieval

Trigger Detection

ChatGPT automatically searches the web when it detects your question might benefit from current information. The system is powered by a fine-tuned GPT-4o model that determines when to fetch live information.

Queries that typically trigger search:

Recent information (sports scores, stock prices, news)
Current events and trending topics
Explicit requests ("search," "check online," "latest news")

Queries that typically don't trigger search:

Educational "how-to" questions ("How to boil an egg?")
Conceptual questions ("What is a dinosaur?")
Tasks using only the model's built-in knowledge

Scale of Web Search

Estimates suggest 20-35% of all ChatGPT prompts lead to live internet searches—roughly 500 million to 875 million queries per day.

Code Interpreter: Invisible Python Execution

How It Works

ChatGPT's Code Interpreter provides access to a Python interpreter in a sandboxed environment. The model can write code, execute it, and return answers—all invisibly to users.

Automatic Triggering

Code execution happens automatically when the model determines it would help:

Data analysis requests
Mathematical calculations
File processing (CSVs, images, PDFs)
Visualization requests

When Python is invoked, you'll see an "Analyzing" status indicator. The model can also self-correct: if code fails, it reads the error callback and automatically enters a debugging loop, retrying up to 3 times to get the right output.

Session Persistence

For a given ChatGPT thread, all Python execution occurs in a single session with global state preserved. Subsequent messages don't redefine variables—they assume they're already there. This enables iterative data exploration without repetitive setup.

Thinking Modes: Reasoning on Demand

The Reasoning Architecture

OpenAI's reasoning models (o1, o3, o4-mini) represent a fundamental shift: they spend additional time "thinking" (generating chains of thought) before answering.

These models maintain an internal dialogue—a hidden "thinking block" where they work through potential solutions step by step before presenting the final answer.

How Thinking Decisions Work

Automatic (GPT-5.2 Auto):

When you select Auto mode, the system decides whether to use instant or thinking responses. The decision uses:

Signals from your prompt and conversation
Learned patterns from how people manually choose models
Historical accuracy data for similar queries

Manual Controls:

GPT-5.2 offers explicit thinking duration controls:

Level	Use Case
Light	Quick, low-risk tasks
Standard	Balanced everyday use
Extended	Complex multi-step reasoning
Heavy	Exhaustive analysis, high-stakes decisions

Scaling Inference-Time Compute

OpenAI's research shows a correlation between accuracy and the logarithm of compute spent thinking. At equal latency and cost with o1, o3 delivers higher performance—and if allowed to think longer, performance keeps climbing.

This represents a new paradigm: improving model outputs by spending more computing power during answer generation, not just during training.

Deep Research: Multi-Agent Planning

Architecture Overview

Deep Research represents ChatGPT's most sophisticated agentic capability—a multi-step research system powered by a specialized o3 variant optimized for web browsing and data analysis.

The Multi-Agent Pipeline

Deep Research orchestrates several specialized components:

1. Clarifying Intent and Scoping

GPT-4o and GPT-4.1 models clarify the question
Gather additional context if needed
Precisely scope the research task

2. Web Grounding

Securely invoke search tools
Gather curated, high-quality web data
Ensure authoritative, up-to-date sources

3. Research Execution

o3-deep-research handles the actual investigation
Reasons step-by-step, pivots as new insights emerge
Synthesizes information across hundreds of sources

Capabilities

Deep Research can:

Browse user-uploaded files
Plot and iterate on graphs using Python
Embed generated graphs and images from websites
Cite specific sentences or passages from sources
Backtrack and react to real-time information

ChatGPT Agent: The Ultimate Orchestrator

Tool Selection Architecture

The ChatGPT Agent (introduced July 2025) represents the pinnacle of tool orchestration, equipped with:

Visual browser: Interacts with the web through GUI (clicks, typing, scrolling)
Text-based browser: Fast reasoning over large text corpora
Terminal: Code execution and file manipulation
API access: Direct integration with public and private APIs

How It Chooses Between Tools

The agent intelligently selects the optimal path for each task:

Task Type	Optimal Tool	Why
Calendar information	API access	Structured data, fast retrieval
Reading long articles	Text browser	Efficient text reasoning
Filling web forms	Visual browser	GUI interaction required
Data transformation	Terminal	Code-level manipulation
Modern UI navigation	Visual browser	Human-designed interfaces

The Power of Combination

The agent can chain tools within a single task:

Open a page using text browser to find a download link
Download a file from the web
Manipulate it by running a terminal command
View the output back in the visual browser

All this happens on a virtual computer that preserves context across tool switches.

Image Generation vs Editing: The Multimodal Decision

One of the most sophisticated routing decisions happens when users work with images. How does ChatGPT know whether to generate a new image or edit an existing one?

GPT-4o: The Transfusion Architecture

OpenAI's GPT-4o image generation is built on what researchers call a Transfusion architecture—a hybrid approach that combines autoregressive transformers (like GPT) with diffusion models (like Stable Diffusion). The OpenAI team famously hinted at this with a whiteboard diagram: "tokens → [transformer] → [diffusion] → pixels."

The Transfusion paper from Meta, Waymo, and USC (August 2024) laid the theoretical groundwork. Unlike previous approaches that forced images into discrete tokens (losing quality), Transfusion keeps images in continuous space while still processing them alongside text in a unified transformer.

The BOI/EOI Token System

The key mechanism that enables generation vs editing decisions is the BOI/EOI token system (Begin-of-Image / End-of-Image):

How it works:

Text and image data are concatenated into a single sequence during both training and inference
Special marker tokens delineate modality boundaries:
- <BOI> signals that subsequent elements are image content
- <EOI> signals that image content has ended
Everything outside BOI...EOI is treated as normal text
Everything inside is treated as continuous image representation

The decision mechanism:

When GPT-4o generates tokens autoregressively, it makes a critical decision at each step: output a text token, or output a <BOI> token to begin image generation. This decision is made by the same transformer that processes your text prompt—meaning the model's understanding of your intent directly determines whether it generates an image.

From Tokens to Pixels: The Diffusion Handoff

Once GPT-4o decides to generate an image (by producing <BOI>), a sophisticated process unfolds:

Noise Initialization: The model appends a block of latent image tokens initialized with pure random noise
Iterative Denoising: The transformer repeatedly processes the sequence, progressively denoising the image patches
Bidirectional Attention: Within the BOI–EOI block, attention is bidirectional (unlike the causal attention for text), allowing the model to treat the image as a coherent 2D entity
VAE Decoding: Once denoising completes, a Variational Autoencoder decodes the latent patches into actual pixels
EOI Emission: The model emits <EOI> to mark completion

Empirical testing by researchers using binary classifiers trained to distinguish autoregressive vs diffusion-generated images consistently classified GPT-4o's outputs as diffusion-based—providing evidence that GPT-4o uses a diffusion head for final image decoding.

How Context Enables Editing Decisions

The generation vs editing decision fundamentally relies on context. GPT-4o's 128,000-token context window maintains:

Image IDs: Every generated or uploaded image gets an internal identifier retained in the conversation
Compositional Memory: The model remembers structural elements, lighting, colors, and spatial relationships
Conversation History: Previous instructions and iterations inform how to interpret new requests

The decision logic:

If no image exists in context and the prompt describes creating something → Generate new image
If an image exists in context and the prompt implies modification → Edit existing image
If ambiguous, the model uses semantic understanding to infer intent

This is why prompt placement matters—if you want to edit, placing your instruction directly after the image reference helps the model correctly interpret your intent as modification rather than fresh generation.

Automatic Prompt Rewriting

GPT-4o employs automatic prompt revision to improve generation quality. When you submit a prompt, the mainline model may rewrite it for better performance. You can access this revised prompt in the revised_prompt field of API responses.

This rewriting can sometimes cause issues in multi-turn editing—if the model rewrites your edit instruction into something closer to a generation prompt, you might get a new image instead of an edit.

Google Gemini: Thinking Models and Thought Signatures

Google's approach differs architecturally while solving the same problems. Gemini 2.5 Flash Image and Gemini 3 Pro Image represent Google's native multimodal generation and editing capabilities.

The "Thinking" Model Architecture

Gemini's image models are designed as thinking models—they incorporate step-by-step reasoning before generating or editing images:

The thinking process:

Prompt Analysis: The model reasons through what the user is asking
Intent Classification: Determines if this is generation, editing, or a hybrid task
Interim "Thought Images": The model generates internal test compositions to refine the approach
Final Rendering: Produces the high-quality output

These interim thought images are processed in the backend (not charged to users) and serve as compositional drafts before the final render.

You can enable "thinking mode" to see the step-by-step reasoning Gemini uses to arrive at its decisions—valuable for debugging and understanding why the model chose generation over editing (or vice versa).

Thought Signatures: Stateful Reasoning Across Turns

The Gemini API is stateless—each request is independent. This creates a challenge: how does the model remember what it generated in previous turns to enable editing?

Google's solution is Thought Signatures—encrypted representations of the model's internal thought process that preserve reasoning context across multi-turn interactions.

How thought signatures work:

When Gemini generates an image, it returns a thoughtSignature alongside the output
This signature encodes the composition logic, spatial relationships, and semantic understanding
On subsequent turns, you pass this signature back to the API
The model uses it to understand the original image's structure for precise editing

Validation requirements:

For Gemini 3 Pro Image, thought signatures are mandatory for conversational editing:

Missing signatures result in a 400 error
Signatures must be passed back exactly as received (immutable binary blobs)
The official SDKs handle this automatically when using the chat feature

Why this matters for generation vs editing:

The presence or absence of a thought signature is a primary signal:

No signature in context → Model interprets request as new generation
Signature present → Model interprets request as modification of that specific composition

Sparse Mixture-of-Experts in Gemini

Gemini 2.5 models use a sparse MoE (Mixture-of-Experts) architecture that dynamically routes tokens to specialized sub-networks:

Different "experts" may handle generation vs editing tasks
The routing mechanism learns to activate generation experts for fresh creation
Editing-focused experts activate when modification context is present
This allows the model to maintain specialized capabilities without the computational cost of a fully dense model

The Unified Model Research Landscape

Both ChatGPT and Gemini build on a rapidly evolving research foundation. Understanding these papers illuminates the decision mechanisms.

Transfusion (Meta, August 2024)

The foundational Transfusion paper introduced the core concepts:

Key innovations:

Continuous image representation: Unlike Chameleon (which discretizes images into tokens), Transfusion keeps images as continuous vectors, avoiding quantization information loss
VAE encoding: Images are encoded as latent patches using a Variational Autoencoder, with patches sequenced left-to-right, top-to-bottom
Dual loss functions: Text tokens use next-token prediction loss; image patches use diffusion loss (denoising objective)
Bidirectional image attention: While text uses causal attention, image regions use bidirectional attention within their blocks

Scaling results:

Transfusion achieved better text-to-image generation with less than 1/3 the computational cost of Chameleon, and matched image-to-text performance with only 21.8% of the compute.

Show-o (ICLR 2025)

Show-o unified autoregressive and discrete diffusion in a single transformer:

Architecture:

Built on a pre-trained LLM foundation
Text tokens processed autoregressively with causal attention
Image tokens processed via discrete diffusion with full attention
Supports understanding (captioning, VQA) and generation (text-to-image, inpainting) in one model

Efficiency:

Show-o requires approximately 20x fewer sampling steps than fully autoregressive image generation, making it practical for real-world applications.

UniVG (Apple, ICCV 2025)

Apple's UniVG demonstrates unified generation and editing with a single weight set:

Task routing mechanism:

Uses special task tokens (e.g., <t2i> for text-to-image)
Input image masks control which regions to generate/preserve
A single 3.7B parameter MM-DiT handles generation, inpainting, instruction-based editing, identity-preserving generation, and more

Key finding:

Text-to-image generation and editing tasks can coexist without performance trade-offs—auxiliary tasks like depth estimation actually enhance editing quality.

GenArtist (NeurIPS 2024)

GenArtist takes an agent-based approach:

How it decides:

MLLM agent analyzes user requirements
Decomposes complex problems into sub-tasks
Plans specific solutions (generation vs editing vs hybrid)
Invokes appropriate external tools from a library
Verifies correctness through multimodal perception

This represents an alternative architecture where the decision is explicit and interpretable, rather than implicit in a unified model.

Technical Deep Dive: The Complete Routing Pipeline

Let's trace how different queries flow through the entire system:

Simple Query → Fast Response

User: "What's the capital of France?"
 
1. Router Analysis: Simple factual query
2. Complexity Score: Low
3. Tool Needs: None
4. Decision: Route to gpt-5-main
5. Output: "Paris" (instant response)

Complex Query → Thinking Mode

User: "Prove that there are infinitely many prime numbers"
 
1. Router Analysis: Mathematical proof request
2. Complexity Score: High
3. Tool Needs: None (pure reasoning)
4. Decision: Route to gpt-5-thinking
5. Thinking Phase: Chain-of-thought generation
6. Output: Step-by-step Euclid's proof

Data Analysis → Code Interpreter

User: "Analyze this CSV and show me the trends"
[Uploads sales_data.csv]
 
1. Router Analysis: Data analysis with file
2. Complexity Score: Medium
3. Tool Needs: Python, visualization
4. Decision: Activate Code Interpreter
5. Execution: pandas + matplotlib code
6. Output: Analysis text + generated chart

Writing Task → Canvas

User: "Write a comprehensive guide to React hooks"
 
1. Router Analysis: Long-form writing request
2. Complexity Score: Medium
3. Tool Needs: Collaborative editing interface
4. Decision: Trigger Canvas
5. Canvas Opens: Side panel with editable document
6. Iterative: User can request targeted edits

Current Events → Web Search

User: "What happened in the stock market today?"
 
1. Router Analysis: Current information request
2. Complexity Score: Low
3. Tool Needs: Web search (time-sensitive)
4. Decision: Trigger web search
5. Search Execution: Query financial news
6. Output: Synthesized market summary with citations

Image Generation Path

User: "Create a cyberpunk cat in a neon-lit alley"
 
1. Router Analysis: Image creation request
2. Tool Needs: Image generation
3. Decision: Route to multimodal generation
4. Tokenization: Text → token sequence
5. Decision Point: No image in context → generation mode
6. BOI Token Generation: Model outputs <BOI>
7. Noise Initialization: Random latent patches appended
8. Iterative Denoising: ~20-50 diffusion steps
9. VAE Decoding: Latent → pixels
10. EOI Emission: Marks completion
11. Output: New image returned

Image Editing Path

User: "Make the cat orange instead of gray"
 
1. Router Analysis: Image modification request
2. Context Check: Previous image ID found
3. Tool Needs: Image editing
4. Decision: Route to multimodal editing
5. BOI Token Generation: Model outputs <BOI>
6. Guided Initialization: Previous image latents + targeted noise
7. Selective Denoising: Preserves structure, modifies color
8. VAE Decoding: Latent → pixels
9. EOI Emission: Marks completion
10. Output: Modified image returned

Complex Research → Deep Research

User: "Research the current state of quantum computing
       and identify the top 5 companies to watch"
 
1. Router Analysis: Multi-step research task
2. Complexity Score: Very high
3. Tool Needs: Extended browsing, synthesis
4. Decision: Launch Deep Research agent
5. Multi-Step:
   - Clarify scope (GPT-4o/4.1)
   - Web grounding (100+ sources)
   - o3-deep-research synthesis
6. Output: Comprehensive report (may take 5-30 minutes)

Practical Implications and Best Practices

When Systems Get It Wrong

Despite sophisticated mechanisms, these systems can misinterpret intent:

Common failure modes:

Prompt drift: Asking for "a minor change" sometimes triggers full regeneration
Context window overflow: Very long conversations may lose image/tool context
Ambiguous language: "Make it better" doesn't clearly signal edit vs regenerate
Missing signatures: In API usage, forgetting to pass thought signatures breaks editing chains
Wrong tool selection: Ambiguous prompts may route to suboptimal capabilities

Best Practices for Reliable Control

For image generation:

Use explicit creation language: "create," "generate," "make a new"
Start fresh conversations for unrelated images
Provide detailed descriptions upfront

For image editing:

Use modification language: "change," "modify," "edit," "adjust"
Place edit instructions immediately after image references
Be specific about what to preserve: "keep the composition but change the colors"
In APIs, always pass thought signatures/image IDs back

For tool routing:

Use explicit tool cues when needed: "search for," "calculate," "write code to"
Phrase complexity appropriately: "think carefully about" vs "quickly tell me"
Start fresh conversations when switching contexts dramatically

Economic and Performance Implications

Why Routing Matters

The router architecture reflects economic reality:

Serving small models is 10-30x cheaper in latency, energy, and compute
Early tests show GPT-5 is almost 100x cheaper than alternatives for certain tasks
Tool-calling errors dropped nearly 50% compared to GPT-4

The Pareto Frontier

GPT-5's router represents a pursuit of the Pareto frontier—the optimal tradeoff between cost and quality. Rather than a single monolithic model, the system dynamically scales intelligence based on need.

This creates a "heterogeneous agentic system" where specialized models handle specific tasks rather than one system attempting everything at maximum compute cost.

The Future: Towards Fully Autonomous Agents

Current Trajectory

The evolution from explicit plugins to invisible orchestration points toward increasingly autonomous AI systems:

2023: Manual plugin selection
2024: Automatic tool calling with GPT-4o
2025: Full router system with GPT-5
2026+: Extended autonomous operation

Modular AI Design

GPT-5's multi-agent architecture (router + models) hints at how we might design modular AI systems that overcome single-model limitations. Rather than training ever-larger monolithic models, the future may be intelligent orchestration of specialized capabilities.

Unified Multimodal Discrete Diffusion (UniDisc, 2025) enables joint understanding and generation across text and images using a single discrete diffusion formulation—including the ability to do multimodal inpainting across both domains simultaneously.

MMaDA (2025) represents the first unified multimodal diffusion model with semi-autoregressive text sampling and non-autoregressive image diffusion.

These advances suggest future systems will have even more seamless, interpretable decision-making across all modalities and tools.

Conclusion

ChatGPT's ability to seamlessly switch between Canvas, Deep Research, Code Interpreter, Thinking modes, Web Search, and Image Generation/Editing represents a fundamental architectural innovation:

The Router: A sophisticated classifier evaluating conversation type, complexity, tool needs, and explicit intent
Mixture of Experts: Sparse activation enabling efficient specialized processing across 52.5T parameters
Trained Triggers: Synthetic data and reinforcement learning teaching models when to invoke specific capabilities
BOI/EOI Tokens: Special markers enabling seamless text-to-image transitions within unified models
Thought Signatures: Stateful context preservation for multi-turn editing workflows
Continuous Learning: Live signals improving routing decisions in production
Multi-Agent Coordination: Specialized sub-systems working together under unified orchestration

Understanding this architecture isn't just academic—it helps users write better prompts (knowing that phrasing influences routing), helps developers build better applications (understanding tool selection mechanics), and illuminates where AI is heading (toward increasingly autonomous, efficiently-orchestrated systems).

The invisible complexity behind a simple chat interface represents one of the most impressive engineering achievements in modern AI—a system that just works, routing trillions of parameters across specialized models to deliver exactly what you need.

References and Further Reading