Running the Gateway

This guide covers the different ways to run vLLM Responses in various environments.

For a task-first reading order, start here, then use the Command Reference for the full option-by-option CLI surface.

Supported entrypoint (important)

The repo has two first-class runtime modes:

vllm-responses serve
vllm serve --responses

Direct startup via uvicorn or gunicorn remains useful for development and tests, but it is not the recommended product entrypoint.

Operational Modes

1. Remote-Upstream Gateway Mode

Use vllm-responses serve when the upstream model server is already managed elsewhere. This is ideal for:

External vLLM deployments.
Cloud-hosted OpenAI-compatible endpoints.
Restarting the gateway independently from the model server.
Multi-worker gateway deployments.

vllm-responses serve --upstream http://127.0.0.1:8000/v1

By default this mode uses upstream Chat Completions transport. To target an upstream native Responses endpoint instead, add:

vllm-responses serve \
  --upstream http://127.0.0.1:8000/v1 \
  --upstream-api-kind responses

This changes only the gateway-to-upstream transport. The public gateway contract stays the same: POST /v1/responses, local previous_response_id, local retrieval, local store, local include, and gateway-owned tool/MCP behavior.

2. Integrated Colocated Mode

Use vllm serve --responses for the single-command local vLLM + gateway experience. This is ideal for:

Local development.
Demos and experimentation.
Operators who want vLLM and the gateway on one public API server.

vllm serve Qwen/Qwen3.6-35B-A3B --responses

By default this mode also uses upstream Chat Completions transport. To target integrated native Responses transport instead, add:

vllm serve Qwen/Qwen3.6-35B-A3B \
  --responses \
  --responses-upstream-api-kind responses

This changes only the gateway-to-upstream transport. The public contract stays the same: POST /v1/responses, GET /v1/responses/{response_id}, local previous_response_id, local retrieval, local store, local include, and gateway-owned tool/MCP behavior.

Configuration

On the supported entrypoints, CLI flags own runtime topology and helper wiring. Environment variables remain available for deployment-scoped settings such as storage, metrics, tracing, auth, and cache.

For vllm-responses serve, the gateway-owned CLI surface is:

CLI Flag	Description
`--upstream`	Exact upstream API base URL
`--upstream-api-kind`	Upstream transport family
`--upstream-ready-timeout`	Upstream readiness timeout
`--upstream-ready-interval`	Upstream readiness polling interval
`--gateway-host`	Bind host
`--gateway-port`	Bind port
`--gateway-workers`	Number of workers
`--web-search-profile`	Enable a shipped `web_search` profile
`--code-interpreter`	Code interpreter runtime policy
`--code-interpreter-port`	Code interpreter port
`--code-interpreter-workers`	Code interpreter worker count
`--code-interpreter-startup-timeout`	Code interpreter readiness timeout
`--mcp-config`	Built-in MCP runtime config path
`--mcp-port`	Built-in MCP runtime loopback port

When --mcp-config is set, vllm-responses serve starts a singleton Built-in MCP runtime process shared by all gateway workers. --mcp-port overrides the loopback runtime port. If --mcp-port is absent, serve uses http://127.0.0.1:5981. --upstream-ready-timeout and --upstream-ready-interval control how long the supervisor waits for the external upstream to become ready. Shipped web_search profiles that require Built-in MCP helper servers provision their default helper entries automatically, so --mcp-config is not required just to enable a shipped profile. For the shipped exa_mcp profile, setting EXA_API_KEY in the gateway environment appends the operator key to the default Exa MCP URL automatically.

For vllm serve --responses, bind/public port are owned by native vLLM flags such as --host and --port. Gateway-owned helper/runtime flags use the namespaced --responses-* family, for example:

--responses-upstream-api-kind
--responses-code-interpreter
--responses-code-interpreter-port
--responses-code-interpreter-workers
--responses-code-interpreter-startup-timeout
--responses-web-search-profile
--responses-mcp-config
--responses-mcp-port

Integrated mode defaults to chat_completions. If you set --responses-upstream-api-kind responses, the gateway uses a guarded internal native Responses path instead of the /v1/chat/completions loopback transport. Public Responses ownership remains unchanged in either case.

See Configuration Reference for env-only deployment settings.

Health Checks

The gateway exposes a health check endpoint useful for load balancers (AWS ALB, Kubernetes probes).

Endpoint: GET /health
Response: 200 OK (JSON: {})

Base URL by mode:

vllm-responses serve: http://127.0.0.1:5969/health by default
vllm serve --responses: same host/port as vllm serve (default http://127.0.0.1:8000/health)

curl http://127.0.0.1:5969/health

The gateway also exposes a /metrics endpoint for Prometheus scraping. See Observability for monitoring setup instructions.

Compatibility Passthrough Endpoints

When the gateway is configured with an upstream OpenAI-compatible base URL via --upstream, you can also call:

GET /v1/models
POST /v1/chat/completions

This allows older Chat Completions clients to point to the gateway base URL directly while POST /v1/responses remains available.

Graceful Shutdown

The gateway handles SIGINT (Ctrl+C) and SIGTERM gracefully:

It stops accepting new connections.
It waits for active requests to complete (within a timeout).
It terminates the Code Interpreter subprocess (if spawned).
It terminates the Built-in MCP runtime subprocess (if started).