vLLM Responses

FastAPI gateway for the OpenAI-style Responses API.

vLLM Responses sits in front of a vLLM server and transforms its standard Chat Completions output into the rich, stateful Responses API format. It gives you advanced capabilities like server-side tool execution and conversation state management without modifying your inference backend.

Why use this gateway?

Stateful Conversations: Maintain conversation history automatically using previous_response_id, backed by a persistent store (SQLite/Postgres).
Built-in Code Interpreter: Let the model write and execute code in a sandboxed environment on the gateway.
MCP Integration (Built-in MCP + Remote MCP): Use configured Built-in MCP servers or Remote MCP declarations with Responses-style streaming events.
Correct SSE Streaming: Receive spec-compliant Server-Sent Events with precise ordering and shape guarantees.
Drop-in Compatibility: Works with any standard vLLM OpenAI-compatible endpoint.

Getting Started

Quickstart
Get up and running in 5 minutes.
Installation
Install the CLI and dependencies.
Running the Gateway
Deploy alongside vLLM or as a standalone service.
Architecture
Understand how the gateway works.

Documentation Map

Usage: CLI reference and streaming guide.
Features: Deep dive into statefulness, built-in tools, and MCP integration.
Reference: API endpoint details, configuration variables, and event schemas.
Deployment: Production configuration.
Examples: Code snippets for code interpreter, MCP usage, and tool loops.

New to the Responses API?

Start with the Quickstart to see the API in action, then check out Architecture to understand the concepts.