Configuration Guide
Configure the gateway's database, caching, workers, and service architecture for your deployment needs.
Overview
This guide covers configuration options for:
- Storage backend (SQLite vs PostgreSQL)
- Worker processes (single vs multiple)
- Response caching optimization (optional Redis integration)
- Service architecture (single-command runtime vs disaggregated)
For complete environment variable reference, see Configuration Reference.
Storage Backend
The gateway stores conversation state for previous_response_id functionality. Choose the storage backend that fits your deployment model.
Stored continuation anchors include terminal responses with status="completed" and status="incomplete" (when store=true).
SQLite (Default)
Zero-configuration storage using a local SQLite database file.
Characteristics:
- Zero setup required
- Single file database (
vllm_responses.db) - Works with multiple workers on the same machine (uses WAL mode)
- Does NOT work across multiple machines
PostgreSQL
Required for multi-machine deployments and high-availability scenarios.
export VR_DB_PATH="postgresql+asyncpg://user:password@db-host:5432/vllm_responses"
vllm-responses serve --upstream http://127.0.0.1:8000/v1
Migration notes: When moving from SQLite to PostgreSQL:
- Set
VR_DB_PATHto your PostgreSQL connection string - Restart the gateway - tables will be created automatically
- Existing SQLite data will NOT be migrated
Worker Configuration
Control gateway throughput by adjusting the number of worker processes.
Single Worker (Default)
The default configuration runs one worker process.
When this is sufficient:
- Local development
- Low to moderate traffic (\<100 concurrent requests)
- Testing and experimentation
Multiple Workers
Increase concurrency by running multiple worker processes.
What this does:
- Handles more concurrent requests
- Utilizes multiple CPU cores
- Each worker shares the same database
Compatibility notes:
- SQLite: Works fine with multiple workers on the same machine (uses WAL mode for concurrent access)
- PostgreSQL: Required for multiple workers across multiple machines (Kubernetes, multi-VM setups)
Upstream Readiness Controls
Tune how long the supervisor waits for an external upstream to become ready.
vllm-responses serve \
--upstream http://127.0.0.1:8000/v1 \
--upstream-ready-timeout 900 \
--upstream-ready-interval 2
Use these when the upstream has a slow cold start or when you want faster failure detection during rollout.
Response Caching Optimization (Optional)
Add Redis caching to reduce database load for previous_response_id lookups.
Configuration
export VR_RESPONSE_STORE_CACHE=1
export VR_REDIS_HOST=localhost
export VR_REDIS_PORT=6379
export VR_RESPONSE_STORE_CACHE_TTL_SECONDS=3600 # 1 hour
vllm-responses serve --upstream http://127.0.0.1:8000/v1
How It Works
Recent responses are cached in Redis. When a request includes previous_response_id, the gateway checks Redis first before querying the database. This significantly reduces database load and latency for active conversations.
Performance impact:
- Cache hits: fast retrieval
- Reduces database connection pool pressure
- Especially beneficial with PostgreSQL over network
MCP Configuration (Optional)
Enable Built-in MCP by providing a runtime config file on the active entrypoint.
Minimal Setup
vllm-responses serve \
--upstream http://127.0.0.1:8000/v1 \
--mcp-config /etc/vllm-responses/mcp.json
For mcp.json examples (URL + stdio styles), see
MCP Examples -> Built-in MCP Runtime Config.
Operational Notes
- If
--mcp-configis omitted, Built-in MCP is disabled. - With
vllm-responses serve, Built-in MCP runs in a singleton internal runtime process shared by all gateway workers. - The supervisor injects
VR_MCP_BUILTIN_RUNTIME_URLfor gateway workers automatically. - Built-in MCP startup and call timeouts are configured globally:
VR_MCP_HOSTED_STARTUP_TIMEOUT_SECVR_MCP_HOSTED_TOOL_TIMEOUT_SEC
- Runtime discovery endpoints:
GET /v1/mcp/serversGET /v1/mcp/servers/{server_label}/tools
Remote MCP Gate
Remote MCP declarations (tools[].type="mcp" with server_url) are enabled by default.
When disabled, any Remote MCP declaration is rejected as a request-level policy error. Built-in MCP mode is unaffected.
Remote MCP URL Policy Checks
Gateway URL policy checks for Remote MCP are enabled by default.
Set to false to bypass gateway-side URL validation checks.
Warning: disabling URL checks increases SSRF and unsafe-endpoint risk and should only be used in tightly controlled environments.
Service Architecture Patterns
The gateway can run in different architectural configurations depending on your scaling and operational needs.
Integrated Single-Command Runtime
Use vllm serve --responses when you want the colocated local stack on one public API server.
Components:
- vLLM API server
- Gateway routes mounted into the same FastAPI app
- Code interpreter helper runtime (optional)
web_searchbuilt-in tool support (optional, when--responses-web-search-profileis set)- Built-in MCP integration (optional, when
--responses-mcp-configis set)- runs as a loopback helper runtime when enabled
- shipped
web_searchprofiles can also cause this helper runtime to be started automatically
Integrated mode example with web_search:
If the shipped exa_mcp profile should use an operator Exa key instead of the
anonymous default, set EXA_API_KEY in the gateway environment before startup.
Integrated mode example with explicit Built-in MCP config:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--responses \
--responses-mcp-config /etc/vllm-responses/mcp.json
Remote-Upstream Gateway Mode
Use vllm-responses serve when inference and gateway should remain separate.
When to use:
- Separate scaling of inference and gateway
- Using existing vLLM infrastructure
- Avoiding model reload when restarting gateway
Configuration Quick Reference
| Configuration | Command/Environment |
|---|---|
| Database (PostgreSQL) | export VR_DB_PATH="postgresql+asyncpg://..." |
| Multiple workers | --gateway-workers 4 |
| Redis cache | export VR_RESPONSE_STORE_CACHE=1 |
| Built-in MCP config | --mcp-config /path/mcp.json or --responses-mcp-config /path/mcp.json |
| Remote MCP | export VR_MCP_REQUEST_REMOTE_ENABLED=false |
| Remote URL checks | export VR_MCP_REQUEST_REMOTE_URL_CHECKS=false |
| External vLLM | --upstream http://vllm:8000/v1 |
Next Steps
- For complete environment variables: See Configuration Reference