Skip to content

vLLM Responses

FastAPI gateway for the OpenAI-style Responses API.

vLLM Responses sits in front of a vLLM server and transforms its standard Chat Completions output into the rich, stateful Responses API format. It gives you advanced capabilities like server-side tool execution and conversation state management without modifying your inference backend.


Why use this gateway?

  • Stateful Conversations: Maintain conversation history automatically using previous_response_id, backed by a persistent store (SQLite/Postgres).
  • Built-in Code Interpreter: Let the model write and execute code in a sandboxed environment on the gateway.
  • MCP Integration (Built-in MCP + Remote MCP): Use configured Built-in MCP servers or Remote MCP declarations with Responses-style streaming events.
  • Correct SSE Streaming: Receive spec-compliant Server-Sent Events with precise ordering and shape guarantees.
  • Drop-in Compatibility: Works with any standard vLLM OpenAI-compatible endpoint.

Getting Started

Documentation Map

  • Usage: CLI reference and streaming guide.
  • Features: Deep dive into statefulness, built-in tools, and MCP integration.
  • Reference: API endpoint details, configuration variables, and event schemas.
  • Deployment: Production configuration.
  • Examples: Code snippets for code interpreter, MCP usage, and tool loops.

New to the Responses API?

Start with the Quickstart to see the API in action, then check out Architecture to understand the concepts.