Skip to content

Quickstart

Get your Responses API gateway running in under 5 minutes.

Prerequisites


1. Start the Gateway

Let the gateway start vLLM for you (requires vllm installed):

vllm-responses serve --gateway-workers 1 -- \
  meta-llama/Llama-3.2-3B-Instruct \
  --port 8457

If you already have vLLM running on port 8457:

vllm-responses serve --upstream http://127.0.0.1:8457

You should see output indicating the server is running at http://127.0.0.1:5969.


2. Send a Request

Now, send a request to the Responses API endpoint (/v1/responses).

curl -X POST http://127.0.0.1:5969/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer dummy" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "input": [{"role": "user", "content": "Calculate the factorial of 5"}],
    "stream": true,
    "tools": [{"type": "code_interpreter"}],
    "include": ["code_interpreter_call.outputs"]
  }'
curl -X POST http://127.0.0.1:5969/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer dummy" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "input": [{"role": "user", "content": "Calculate the factorial of 5"}],
    "tools": [{"type": "code_interpreter"}],
    "include": ["code_interpreter_call.outputs"]
  }'
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:5969/v1", api_key="dummy")


with client.responses.stream(
    model="meta-llama/Llama-3.2-3B-Instruct",
    input=[{"role": "user", "content": "Calculate the factorial of 5"}],
    tools=[{"type": "code_interpreter"}],
    include=["code_interpreter_call.outputs"],
) as stream:
    for event in stream:
        print(event)

3. Observe the Response

If you used stream=true, you will see Server-Sent Events (SSE). Unlike standard Chat Completions, the Responses API provides rich lifecycle events:

event: response.created
data: {"response":{...}}

event: response.output_item.added
data: {"output_item":{"type":"message", ...}}

event: response.content_part.added
data: {"part":{"type":"text", "text":""}, ...}

event: response.output_text.delta
data: {"delta":"I am a large language model...", ...}

...

event: response.completed
data: {"response":{...}}

4. Optional: MCP Smoke Test (Built-in MCP)

If you enabled Built-in MCP (configured VR_MCP_CONFIG_PATH and a server label/tool), you can run a minimal forced tool call.

Need the Built-in MCP mcp.json format first? See:

curl -X POST http://127.0.0.1:5969/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer dummy" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "stream": true,
    "input": [{"role":"user","content":"Use the MCP docs tool to search for migration notes."}],
    "tools": [{"type":"mcp","server_label":"github_docs"}],
    "tool_choice": {"type":"mcp","server_label":"github_docs","name":"search_docs"}
  }'

In the stream, you should see MCP lifecycle events such as:

  • response.mcp_call.in_progress
  • response.mcp_call_arguments.done
  • response.mcp_call.completed (or response.mcp_call.failed)

Next Steps

Now that you have the basic loop working, try the advanced features: