Running the Gateway
This guide covers the different ways to run vLLM Responses in various environments.
For a task-first reading order, start here, then use the Command Reference for the full option-by-option CLI surface.
Supported entrypoint (important)
The repo has two first-class runtime modes:
vllm-responses servevllm serve --responses
Direct startup via uvicorn or gunicorn remains useful for development and tests, but it is not the
recommended product entrypoint.
Operational Modes
1. Remote-Upstream Gateway Mode
Use vllm-responses serve when the upstream model server is already managed elsewhere. This is ideal for:
- External vLLM deployments.
- Cloud-hosted OpenAI-compatible endpoints.
- Restarting the gateway independently from the model server.
- Multi-worker gateway deployments.
By default this mode uses upstream Chat Completions transport. To target an upstream native Responses endpoint instead, add:
This changes only the gateway-to-upstream transport. The public gateway contract stays the same:
POST /v1/responses, local previous_response_id, local retrieval, local store, local
include, and gateway-owned tool/MCP behavior.
2. Integrated Colocated Mode
Use vllm serve --responses for the single-command local vLLM + gateway experience. This is ideal for:
- Local development.
- Demos and experimentation.
- Operators who want vLLM and the gateway on one public API server.
By default this mode also uses upstream Chat Completions transport. To target integrated native Responses transport instead, add:
This changes only the gateway-to-upstream transport. The public contract stays the same:
POST /v1/responses, GET /v1/responses/{response_id}, local previous_response_id, local
retrieval, local store, local include, and gateway-owned tool/MCP behavior.
Configuration
On the supported entrypoints, CLI flags own runtime topology and helper wiring. Environment variables remain available for deployment-scoped settings such as storage, metrics, tracing, auth, and cache.
For vllm-responses serve, the gateway-owned CLI surface is:
| CLI Flag | Description |
|---|---|
--upstream |
Exact upstream API base URL |
--upstream-api-kind |
Upstream transport family |
--upstream-ready-timeout |
Upstream readiness timeout |
--upstream-ready-interval |
Upstream readiness polling interval |
--gateway-host |
Bind host |
--gateway-port |
Bind port |
--gateway-workers |
Number of workers |
--web-search-profile |
Enable a shipped web_search profile |
--code-interpreter |
Code interpreter runtime policy |
--code-interpreter-port |
Code interpreter port |
--code-interpreter-workers |
Code interpreter worker count |
--code-interpreter-startup-timeout |
Code interpreter readiness timeout |
--mcp-config |
Built-in MCP runtime config path |
--mcp-port |
Built-in MCP runtime loopback port |
When --mcp-config is set, vllm-responses serve starts a singleton Built-in MCP runtime process shared by all gateway workers.
--mcp-port overrides the loopback runtime port.
If --mcp-port is absent, serve uses http://127.0.0.1:5981.
--upstream-ready-timeout and --upstream-ready-interval control how long the supervisor waits for the external upstream to become ready.
Shipped web_search profiles that require Built-in MCP helper servers provision their default helper entries automatically, so --mcp-config is not required just to enable a shipped profile.
For the shipped exa_mcp profile, setting EXA_API_KEY in the gateway
environment appends the operator key to the default Exa MCP URL automatically.
For vllm serve --responses, bind/public port are owned by native vLLM flags such as --host and --port.
Gateway-owned helper/runtime flags use the namespaced --responses-* family, for example:
--responses-upstream-api-kind--responses-code-interpreter--responses-code-interpreter-port--responses-code-interpreter-workers--responses-code-interpreter-startup-timeout--responses-web-search-profile--responses-mcp-config--responses-mcp-port
Integrated mode defaults to chat_completions. If you set
--responses-upstream-api-kind responses, the gateway uses a guarded internal native Responses
path instead of the /v1/chat/completions loopback transport. Public Responses ownership remains
unchanged in either case.
See Configuration Reference for env-only deployment settings.
Health Checks
The gateway exposes a health check endpoint useful for load balancers (AWS ALB, Kubernetes probes).
- Endpoint:
GET /health - Response:
200 OK(JSON:{})
Base URL by mode:
vllm-responses serve:http://127.0.0.1:5969/healthby defaultvllm serve --responses: same host/port asvllm serve(defaulthttp://127.0.0.1:8000/health)
The gateway also exposes a /metrics endpoint for Prometheus scraping. See Observability for monitoring setup instructions.
Compatibility Passthrough Endpoints
When the gateway is configured with an upstream OpenAI-compatible base URL via --upstream, you can also call:
GET /v1/modelsPOST /v1/chat/completions
This allows older Chat Completions clients to point to the gateway base URL directly while POST /v1/responses remains available.
Graceful Shutdown
The gateway handles SIGINT (Ctrl+C) and SIGTERM gracefully:
- It stops accepting new connections.
- It waits for active requests to complete (within a timeout).
- It terminates the Code Interpreter subprocess (if spawned).
- It terminates the Built-in MCP runtime subprocess (if started).