Configuration Guide
Configure the gateway's database, caching, workers, and service architecture for your deployment needs.
Overview
This guide covers configuration options for:
- Storage backend (SQLite vs PostgreSQL)
- Worker processes (single vs multiple)
- Response caching optimization (optional Redis integration)
- Service architecture (single-command runtime vs disaggregated)
For complete environment variable reference, see Configuration Reference.
Storage Backend
The gateway stores conversation state for previous_response_id functionality. Choose the storage backend that fits your deployment model.
Stored continuation anchors include terminal responses with status="completed" and status="incomplete" (when store=true).
SQLite (Default)
Zero-configuration storage using a local SQLite database file.
Characteristics:
- Zero setup required
- Single file database (
vllm_responses.db) - Works with multiple workers on the same machine (uses WAL mode)
- Does NOT work across multiple machines
PostgreSQL
Required for multi-machine deployments and high-availability scenarios.
export VR_DB_PATH="postgresql+asyncpg://user:password@db-host:5432/vllm_responses"
vllm-responses serve --upstream http://127.0.0.1:8457
Migration notes: When moving from SQLite to PostgreSQL:
- Set
VR_DB_PATHto your PostgreSQL connection string - Restart the gateway - tables will be created automatically
- Existing SQLite data will NOT be migrated
Worker Configuration
Control gateway throughput by adjusting the number of worker processes.
Single Worker (Default)
The default configuration runs one worker process.
When this is sufficient:
- Local development
- Low to moderate traffic (\<100 concurrent requests)
- Testing and experimentation
Multiple Workers
Increase concurrency by running multiple worker processes.
What this does:
- Handles more concurrent requests
- Utilizes multiple CPU cores
- Each worker shares the same database
Compatibility notes:
- SQLite: Works fine with multiple workers on the same machine (uses WAL mode for concurrent access)
- PostgreSQL: Required for multiple workers across multiple machines (Kubernetes, multi-VM setups)
Response Caching Optimization (Optional)
Add Redis caching to reduce database load for previous_response_id lookups.
Configuration
export VR_RESPONSE_STORE_CACHE=1
export VR_REDIS_HOST=localhost
export VR_REDIS_PORT=6379
export VR_RESPONSE_STORE_CACHE_TTL_SECONDS=3600 # 1 hour
vllm-responses serve --upstream http://127.0.0.1:8457
How It Works
Recent responses are cached in Redis. When a request includes previous_response_id, the gateway checks Redis first before querying the database. This significantly reduces database load and latency for active conversations.
Performance impact:
- Cache hits: fast retrieval
- Reduces database connection pool pressure
- Especially beneficial with PostgreSQL over network
MCP Configuration (Optional)
Enable Built-in MCP by providing a runtime config file and setting VR_MCP_CONFIG_PATH.
Minimal Setup
export VR_MCP_CONFIG_PATH="/etc/vllm-responses/mcp.json"
vllm-responses serve --upstream http://127.0.0.1:8457
For mcp.json examples (URL + stdio styles), see
MCP Examples -> Built-in MCP Runtime Config.
Operational Notes
- If
VR_MCP_CONFIG_PATHis unset, Built-in MCP is disabled. - With
vllm-responses serve, Built-in MCP runs in a singleton internal runtime process shared by all gateway workers. - The supervisor injects
VR_MCP_BUILTIN_RUNTIME_URLfor gateway workers automatically. - Built-in MCP startup and call timeouts are configured globally:
VR_MCP_HOSTED_STARTUP_TIMEOUT_SECVR_MCP_HOSTED_TOOL_TIMEOUT_SEC
- Runtime discovery endpoints:
GET /v1/mcp/serversGET /v1/mcp/servers/{server_label}/tools
Remote MCP Gate
Remote MCP declarations (tools[].type="mcp" with server_url) are enabled by default.
When disabled, any Remote MCP declaration is rejected as a request-level policy error. Built-in MCP mode is unaffected.
Remote MCP URL Policy Checks
Gateway URL policy checks for Remote MCP are enabled by default.
Set to false to bypass gateway-side URL validation checks.
Warning: disabling URL checks increases SSRF and unsafe-endpoint risk and should only be used in tightly controlled environments.
Service Architecture Patterns
The gateway can run in different architectural configurations depending on your scaling and operational needs.
Single-Command Runtime (Default)
The serve command runs the gateway with managed local components by default.
Components:
- vLLM subprocess
- Gateway (1+ workers)
- Code interpreter subprocess
- Built-in MCP integration (optional, when
VR_MCP_CONFIG_PATHis set)- runs as a singleton loopback runtime process shared by all gateway workers
Disaggregated
Run each component separately for flexibility and independent scaling.
Gateway + External vLLM
Use an existing vLLM deployment or scale inference separately from the gateway.
# Somewhere else: vLLM is already running
vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8457
# Gateway points to external vLLM
vllm-responses serve --upstream http://127.0.0.1:8457
When to use:
- Separate of inference and gateway
- Using existing vLLM infrastructure
- Avoiding model reload when restarting gateway
Configuration Quick Reference
| Configuration | Command/Environment |
|---|---|
| Database (PostgreSQL) | export VR_DB_PATH="postgresql+asyncpg://..." |
| Multiple workers | --gateway-workers 4 |
| Redis cache | export VR_RESPONSE_STORE_CACHE=1 |
| Built-in MCP config | export VR_MCP_CONFIG_PATH="/path/mcp.json" |
| Remote MCP | export VR_MCP_REQUEST_REMOTE_ENABLED=false |
| Remote URL checks | export VR_MCP_REQUEST_REMOTE_URL_CHECKS=false |
| External vLLM | --upstream http://vllm:8000 |
Next Steps
- For complete environment variables: See Configuration Reference