Relay Server
The relay server is a TypeScript / Node.js WebSocket server that sits between clients and AI providers. It translates the relay protocol into provider-specific formats (Gemini Live, OpenAI Realtime), handles server-side tool calls, manages session lifecycle, and provides observability via Langfuse.
Running
Section titled “Running”cd relay-servercp .env.example .env # add your API keysyarn installyarn dev # starts with tsx watch (auto-reload)yarn buildyarn start # runs dist/index.jsThe server starts on port 8080 by default. It exposes:
GET /health— health check endpoint (returns{"status": "ok"})GET /test— browser-based test page for quick WebSocket testingws://localhost:8080/ws— WebSocket endpoint for clients
Configuration
Section titled “Configuration”Environment variables (see .env.example):
| Variable | Required | Description |
|---|---|---|
GEMINI_API_KEY | Yes (for Gemini) | Google AI API key |
OPENAI_API_KEY | Yes (for OpenAI) | OpenAI API key |
RELAY_API_KEY | Recommended | Shared secret clients must send in session.config.apiKey |
PORT | No | Server port (default: 8080) |
BRAIN_GATEWAY_URL | No | Brain agent gateway URL (default: http://localhost:18789) |
BRAIN_GATEWAY_AUTH_TOKEN | No | Auth token for the brain agent gateway |
ROTATION_INTERVAL_MS | No | OpenAI session rotation interval (default: 3000000 / 50 min) |
LANGFUSE_BASE_URL | No | Langfuse endpoint for tracing |
LANGFUSE_PUBLIC_KEY | No | Langfuse public key |
LANGFUSE_SECRET_KEY | No | Langfuse secret key |
WebSocket Protocol
Section titled “WebSocket Protocol”All messages are JSON objects with a type field. The protocol is symmetric — clients send ClientEvent types, the server sends RelayEvent types.
Client to Relay
Section titled “Client to Relay”session.config
Section titled “session.config”Must be the first message after connecting. Configures the session.
{ "type": "session.config", "provider": "gemini", "voice": "Zephyr", "model": "gemini-3.1-flash-live-preview", "brainAgent": "enabled", "apiKey": "your-relay-api-key", "sessionKey": "optional-session-identifier", "deviceContext": { "timezone": "America/Los_Angeles", "locale": "en-US", "deviceModel": "iPhone 15 Pro", "location": "San Francisco, CA" }, "instructionsOverride": "Optional user-specific context", "conversationHistory": [ { "role": "user", "text": "Previous message" }, { "role": "assistant", "text": "Previous response" } ]}Fields:
provider—"gemini"or"openai"voice— voice name (Gemini: Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, Zephyr; OpenAI: any supported voice, default “marin”)model— model identifier (default:gemini-3.1-flash-live-previewfor Gemini,gpt-realtime-minifor OpenAI)brainAgent—"enabled"or"none"apiKey— must matchRELAY_API_KEYif set on the serverconversationHistory— prior messages to inject as context (for continuing conversations)
audio.append
Section titled “audio.append”Stream microphone audio to the provider.
{ "type": "audio.append", "data": "<base64-encoded PCM16 audio>"}audio.commit
Section titled “audio.commit”Signal that the current audio buffer is complete (used by OpenAI’s manual VAD mode). Gemini uses automatic VAD and ignores this.
{ "type": "audio.commit"}frame.append
Section titled “frame.append”Send a screen capture frame (desktop screen sharing).
{ "type": "frame.append", "data": "<base64-encoded JPEG>", "mimeType": "image/jpeg"}response.create
Section titled “response.create”Manually trigger a response from the provider. Gemini auto-responds based on VAD, so this is mainly for OpenAI.
{ "type": "response.create"}response.cancel
Section titled “response.cancel”Cancel an in-progress response.
{ "type": "response.cancel"}tool.result
Section titled “tool.result”Send the result of a client-side tool call back to the relay.
{ "type": "tool.result", "callId": "call_abc123", "output": "{\"result\": \"some data\"}"}client.timing
Section titled “client.timing”Report client-side latency measurements for tracing.
{ "type": "client.timing", "phase": "ttft_audio", "ms": 342, "turnId": "turn-uuid"}Relay to Client
Section titled “Relay to Client”session.ready
Section titled “session.ready”Sent after successful connection to the AI provider.
{ "type": "session.ready", "sessionId": "uuid"}audio.delta
Section titled “audio.delta”Streaming audio from the AI model.
{ "type": "audio.delta", "data": "<base64-encoded PCM16 audio>"}transcript.delta
Section titled “transcript.delta”Streaming transcript text (partial, for live display).
{ "type": "transcript.delta", "text": "partial text", "role": "assistant"}transcript.done
Section titled “transcript.done”Final transcript for a complete utterance.
{ "type": "transcript.done", "text": "complete utterance text", "role": "user"}tool.call
Section titled “tool.call”The AI model wants to call a tool. Server-side tools (echo_tool, ask_brain) are handled by the relay automatically and not forwarded to the client.
{ "type": "tool.call", "callId": "call_abc123", "name": "some_tool", "arguments": "{\"param\": \"value\"}"}tool.progress
Section titled “tool.progress”Progress update from an async tool (e.g., brain agent search steps).
{ "type": "tool.progress", "callId": "call_abc123", "summary": "Searching the web for..."}tool.cancelled
Section titled “tool.cancelled”The AI model cancelled a tool call (e.g., user changed topic mid-search).
{ "type": "tool.cancelled", "callIds": ["call_abc123"]}turn.started
Section titled “turn.started”The user started speaking (detected by VAD). Clients should stop audio playback for barge-in.
{ "type": "turn.started", "turnId": "optional-trace-id"}turn.ended
Section titled “turn.ended”The AI finished its response.
{ "type": "turn.ended"}session.rotating / session.rotated
Section titled “session.rotating / session.rotated”Emitted during session rotation (long-running session refresh). Clients should clear audio playback buffers.
{ "type": "session.rotating" }{ "type": "session.rotated", "sessionId": "new-session-id"}session.ended
Section titled “session.ended”Session summary on clean shutdown.
{ "type": "session.ended", "summary": "Conversation summary", "durationSec": 300, "turnCount": 12}Error from the relay or upstream provider.
{ "type": "error", "message": "description of the error", "code": 502}Common error codes:
400— invalid message format401— unauthorized (bad API key)500— adapter connection failed502— upstream provider error
Supported Providers
Section titled “Supported Providers”Gemini Live
Section titled “Gemini Live”- Model:
gemini-3.1-flash-live-preview(default) - Audio: 16kHz PCM16 (relay downsamples from 24kHz)
- VAD: automatic activity detection (server-side)
- Video: supported (JPEG frames via
realtimeInput.video) - Session resumption: transparent reconnection using resumption handles
- Rotation: proactive on
goAway, deferred if tool calls are in flight - Context window compression: enabled with sliding window (10k token trigger)
- Voices: Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, Zephyr
OpenAI Realtime
Section titled “OpenAI Realtime”- Model:
gpt-realtime-mini(default) - Audio: 24kHz PCM16 (native)
- VAD: server-side with configurable threshold
- Video: not supported
- Rotation: timer-based (default 50 minutes), transcript summary injected into new session
- Transcription:
gpt-4o-mini-transcribefor input audio - Voices: any OpenAI-supported voice (default: “marin”)
Brain Agent
Section titled “Brain Agent”The brain agent (ask_brain tool) connects to any OpenAI-compatible chat completions endpoint (e.g. OpenClaw) to give the voice AI extended capabilities:
- Web search and browsing
- Calendar management
- Task tracking
- Long-term memory across sessions
- File operations
The brain runs asynchronously — the relay returns {"status": "searching"} immediately so the AI can give a verbal bridge (“Let me check…”), then injects the result into the conversation when it arrives.
On disconnect, the relay syncs the full conversation transcript to the brain for long-term memory, with retry logic (backoff at 5s, 30s, 2min).
Server-Side Tools
Section titled “Server-Side Tools”Two tools are handled server-side (never forwarded to the client):
echo_tool— test tool that echoes back input (useful for verifying the tool pipeline)ask_brain— async brain agent call (described above)
Watchdog
Section titled “Watchdog”If no audio is received for 20 seconds, the relay injects a gentle prompt asking the AI to check if the user is still there. The watchdog pauses during tool calls to avoid interrupting work in progress.