Quickstart
Get your Responses API gateway running in under 5 minutes.
Prerequisites
- Completed Installation
1. Start the Gateway
Base URL by mode:
vllm-responses serve:http://127.0.0.1:5969by defaultvllm serve --responses: same host/port asvllm serve(defaulthttp://127.0.0.1:8000)
2. Send a Request
Now, send a request to the Responses API endpoint (/v1/responses).
The cURL examples below use the default vllm-responses serve URL (http://127.0.0.1:5969).
If you started integrated mode with vllm serve --responses, replace that base URL with your
vLLM bind address (default http://127.0.0.1:8000).
curl -X POST http://127.0.0.1:5969/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer dummy" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"input": [{"role": "user", "content": "Calculate the factorial of 5"}],
"stream": true,
"tools": [{"type": "code_interpreter"}],
"include": ["code_interpreter_call.outputs"]
}'
curl -X POST http://127.0.0.1:5969/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer dummy" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"input": [{"role": "user", "content": "Calculate the factorial of 5"}],
"tools": [{"type": "code_interpreter"}],
"include": ["code_interpreter_call.outputs"]
}'
import os
from openai import OpenAI
client = OpenAI(
base_url=os.environ.get("VLLM_RESPONSES_BASE_URL", "http://127.0.0.1:5969/v1"),
api_key="dummy",
)
# For integrated mode, export:
# VLLM_RESPONSES_BASE_URL=http://127.0.0.1:8000/v1
#
# For `vllm-responses serve`, the default snippet base URL already matches:
# http://127.0.0.1:5969/v1
with client.responses.stream(
model="meta-llama/Llama-3.2-3B-Instruct",
input=[{"role": "user", "content": "Calculate the factorial of 5"}],
tools=[{"type": "code_interpreter"}],
include=["code_interpreter_call.outputs"],
) as stream:
for event in stream:
print(event)
3. Observe the Response
If you used stream=true, you will see Server-Sent Events (SSE). Unlike standard Chat Completions, the Responses API provides rich lifecycle events:
event: response.created
data: {"response":{...}}
event: response.output_item.added
data: {"output_item":{"type":"message", ...}}
event: response.content_part.added
data: {"part":{"type":"text", "text":""}, ...}
event: response.output_text.delta
data: {"delta":"I am a large language model...", ...}
...
event: response.completed
data: {"response":{...}}
4. Optional: MCP Smoke Test (Built-in MCP)
If you enabled Built-in MCP on your active entrypoint, you can run a minimal forced tool call:
vllm-responses serve ... --mcp-config /path/to/mcp.jsonvllm serve ... --responses --responses-mcp-config /path/to/mcp.json
Need the Built-in MCP mcp.json format first? See:
curl -X POST http://127.0.0.1:5969/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer dummy" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"stream": true,
"input": [{"role":"user","content":"Use the MCP docs tool to search for migration notes."}],
"tools": [{"type":"mcp","server_label":"github_docs"}],
"tool_choice": {"type":"mcp","server_label":"github_docs","name":"search_docs"}
}'
If you are running integrated mode, replace http://127.0.0.1:5969 with your vllm serve
base URL (default http://127.0.0.1:8000).
In the stream, you should see MCP lifecycle events such as:
response.mcp_call.in_progressresponse.mcp_call_arguments.doneresponse.mcp_call.completed(orresponse.mcp_call.failed)
Next Steps
Now that you have the basic loop working, try the advanced features:
- Code Interpreter: Ask the model to write and execute code.
- Web Search: Let the model search the web with a shipped gateway profile.
- Stateful Conversations: Use
previous_response_idto continue a chat. - MCP Integration: Use Built-in MCP or Remote MCP declarations.
- Architecture: Learn how the gateway processes your request.