Skip to content

Quickstart

Get your Responses API gateway running in under 5 minutes.

Prerequisites


1. Start the Gateway

Start vLLM and the Responses gateway together on one public API server:

vllm serve meta-llama/Llama-3.2-3B-Instruct --responses

If you already have vLLM running on port 8457:

vllm-responses serve --upstream http://127.0.0.1:8000/v1

Base URL by mode:

  • vllm-responses serve: http://127.0.0.1:5969 by default
  • vllm serve --responses: same host/port as vllm serve (default http://127.0.0.1:8000)

2. Send a Request

Now, send a request to the Responses API endpoint (/v1/responses).

The cURL examples below use the default vllm-responses serve URL (http://127.0.0.1:5969). If you started integrated mode with vllm serve --responses, replace that base URL with your vLLM bind address (default http://127.0.0.1:8000).

curl -X POST http://127.0.0.1:5969/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer dummy" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "input": [{"role": "user", "content": "Calculate the factorial of 5"}],
    "stream": true,
    "tools": [{"type": "code_interpreter"}],
    "include": ["code_interpreter_call.outputs"]
  }'
curl -X POST http://127.0.0.1:5969/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer dummy" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "input": [{"role": "user", "content": "Calculate the factorial of 5"}],
    "tools": [{"type": "code_interpreter"}],
    "include": ["code_interpreter_call.outputs"]
  }'
import os

from openai import OpenAI

client = OpenAI(
    base_url=os.environ.get("VLLM_RESPONSES_BASE_URL", "http://127.0.0.1:5969/v1"),
    api_key="dummy",
)


# For integrated mode, export:
#   VLLM_RESPONSES_BASE_URL=http://127.0.0.1:8000/v1
#
# For `vllm-responses serve`, the default snippet base URL already matches:
#   http://127.0.0.1:5969/v1

with client.responses.stream(
    model="meta-llama/Llama-3.2-3B-Instruct",
    input=[{"role": "user", "content": "Calculate the factorial of 5"}],
    tools=[{"type": "code_interpreter"}],
    include=["code_interpreter_call.outputs"],
) as stream:
    for event in stream:
        print(event)

3. Observe the Response

If you used stream=true, you will see Server-Sent Events (SSE). Unlike standard Chat Completions, the Responses API provides rich lifecycle events:

event: response.created
data: {"response":{...}}

event: response.output_item.added
data: {"output_item":{"type":"message", ...}}

event: response.content_part.added
data: {"part":{"type":"text", "text":""}, ...}

event: response.output_text.delta
data: {"delta":"I am a large language model...", ...}

...

event: response.completed
data: {"response":{...}}

4. Optional: MCP Smoke Test (Built-in MCP)

If you enabled Built-in MCP on your active entrypoint, you can run a minimal forced tool call:

  • vllm-responses serve ... --mcp-config /path/to/mcp.json
  • vllm serve ... --responses --responses-mcp-config /path/to/mcp.json

Need the Built-in MCP mcp.json format first? See:

curl -X POST http://127.0.0.1:5969/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer dummy" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "stream": true,
    "input": [{"role":"user","content":"Use the MCP docs tool to search for migration notes."}],
    "tools": [{"type":"mcp","server_label":"github_docs"}],
    "tool_choice": {"type":"mcp","server_label":"github_docs","name":"search_docs"}
  }'

If you are running integrated mode, replace http://127.0.0.1:5969 with your vllm serve base URL (default http://127.0.0.1:8000).

In the stream, you should see MCP lifecycle events such as:

  • response.mcp_call.in_progress
  • response.mcp_call_arguments.done
  • response.mcp_call.completed (or response.mcp_call.failed)

Next Steps

Now that you have the basic loop working, try the advanced features: