Voice
Real-time speech-to-text in the chat composer. The user speaks, the runtime transcribes, the agent runs the resulting prompt.
"use client";import { CopilotKit } from "@copilotkit/react-core/v2";import { VoiceChat } from "./voice-chat";export default function VoiceDemoPage() { return ( <CopilotKit runtimeUrl="/api/copilotkit-voice" agent="voice-demo" useSingleEndpoint={false} // The dev-only `<cpk-web-inspector>` overlay (auto-enabled on // localhost via shouldShowDevConsole) intercepts pointer events // on top of the voice sample-audio button, so dev/D5 probe runs // can't click it through Playwright. Production isn't localhost // so the inspector never mounts there — voice is D5 in prod and // D4 locally for this reason alone. Disable explicitly here so // the demo behaves the same in both environments. enableInspector={false} > <VoiceChat /> </CopilotKit> );}You have a working chat surface and you want users to be able to speak instead of type. By the end of this guide, the chat composer will sprout a mic button, recorded audio will be transcribed by the runtime, and the transcript will auto-send to the agent like any other message.
When to use this#
- Hands-free or accessibility flows where typing isn't the right input modality.
- Mobile or kiosk surfaces where a long voice query is faster than thumb-typing.
- Demo and test loops where you want canned audio to drive the chat without a microphone.
If you only need file uploads (audio, images, video, documents), use Multimodal Attachments instead. Voice is specifically about live transcription of recorded speech into chat input.
Frontend#
<CopilotChat /> renders the mic button automatically when the runtime advertises audioFileTranscriptionEnabled: true on its /info endpoint. There's nothing to wire up on the chat surface itself:
import { CopilotKit } from "@copilotkit/react-core/v2";import { VoiceChat } from "./voice-chat";export default function VoiceDemoPage() { return ( <CopilotKit runtimeUrl="/api/copilotkit-voice" agent="voice-demo" useSingleEndpoint={false} // The dev-only `<cpk-web-inspector>` overlay (auto-enabled on // localhost via shouldShowDevConsole) intercepts pointer events // on top of the voice sample-audio button, so dev/D5 probe runs // can't click it through Playwright. Production isn't localhost // so the inspector never mounts there — voice is D5 in prod and // D4 locally for this reason alone. Disable explicitly here so // the demo behaves the same in both environments. enableInspector={false} > <VoiceChat /> </CopilotKit> );}When the user clicks the mic, the chat captures audio, POSTs it to the runtime's /transcribe endpoint, drops the resulting transcript into the composer, and submits.
Driving the demo without a mic#
For Playwright runs, screenshots, or any flow where prompting for mic permissions is awkward, ship a button that POSTs a bundled audio clip directly to the same /transcribe endpoint:
export function SampleAudioButton({ onTranscribed, sampleText,}: SampleAudioButtonProps) { return ( <button type="button" data-testid="voice-sample-audio-button" onClick={() => onTranscribed(sampleText)} title={`Inserts: "${sampleText}"`} className="inline-flex w-fit items-center gap-2 rounded-md border border-black/10 bg-white px-3 py-1.5 text-xs font-medium hover:bg-black/5 dark:border-white/10 dark:bg-black/30 dark:hover:bg-white/10" > <span aria-hidden>🎙</span> <span>Try a sample audio</span> </button> );}The caller can drop the resulting text into the composer's textarea (matched via data-testid="copilot-chat-textarea") using the native value setter and a synthetic input event so React's managed state updates correctly.
Backend#
Wire up the V2 runtime with a TranscriptionService. The V1 wrapper drops the transcriptionService option, so use createCopilotRuntimeHandler from @copilotkit/runtime/v2 directly:
import type { NextRequest } from "next/server";import { CopilotRuntime, TranscriptionService, createCopilotRuntimeHandler,} from "@copilotkit/runtime/v2";import type { TranscribeFileOptions } from "@copilotkit/runtime/v2";import { HttpAgent } from "@ag-ui/client";import { TranscriptionServiceOpenAI } from "@copilotkit/voice";import OpenAI from "openai";const AGENT_URL = process.env.AGENT_URL || "http://localhost:8000";const voiceDemoAgent = new HttpAgent({ url: `${AGENT_URL}/voice` });/** * Transcription service wrapper that pins `baseURL` to real OpenAI (or * `OPENAI_TRANSCRIPTION_BASE_URL` when explicitly set) instead of falling * through to `OPENAI_BASE_URL`. In local docker / Railway preview * environments `OPENAI_BASE_URL` points at aimock so LLM completions stay * deterministic, but aimock's proxy mode mangles multipart audio bodies on * forward — Whisper rejects with `502 Invalid file format` even when the * recorded webm/opus bytes are valid. Bypassing aimock for transcription * lets real Whisper see the original bytes and the demo's mic round-trip * actually works. Mirrors what langgraph-python does in its voice route. * * The sample-audio button is the deterministic affordance (synchronous * text injection); the mic is the only path that should exercise real * Whisper. */class GuardedOpenAITranscriptionService extends TranscriptionService { private delegate: TranscriptionServiceOpenAI | null; constructor() { super(); const apiKey = process.env.OPENAI_API_KEY; const baseURL = process.env.OPENAI_TRANSCRIPTION_BASE_URL ?? "https://api.openai.com/v1"; this.delegate = apiKey ? new TranscriptionServiceOpenAI({ openai: new OpenAI({ apiKey, baseURL }), }) : null; } async transcribeFile(options: TranscribeFileOptions): Promise<string> { if (!this.delegate) { throw new Error( "OPENAI_API_KEY not configured for this deployment (api key missing). " + "Set OPENAI_API_KEY to enable voice transcription.", ); } return this.delegate.transcribeFile(options); }}let cachedHandler: ((req: Request) => Promise<Response>) | null = null;function getHandler(): (req: Request) => Promise<Response> { if (cachedHandler) return cachedHandler; const runtime = new CopilotRuntime({ // @ts-ignore -- see main route.ts; published agents type generic mismatch agents: { "voice-demo": voiceDemoAgent, default: voiceDemoAgent, }, transcriptionService: new GuardedOpenAITranscriptionService(), }); cachedHandler = createCopilotRuntimeHandler({ runtime, basePath: "/api/copilotkit-voice", }); return cachedHandler;}export const POST = (req: NextRequest) => getHandler()(req);export const GET = (req: NextRequest) => getHandler()(req);export const PUT = (req: NextRequest) => getHandler()(req);export const DELETE = (req: NextRequest) => getHandler()(req);With transcriptionService set, the runtime advertises audioFileTranscriptionEnabled: true on /info (which is what tells the chat to render the mic button) and routes POST /transcribe to the service.
Custom transcription backends#
TranscriptionService from @copilotkit/runtime/v2 is an abstract class. Subclass it to plug in any transcription provider — Whisper, AssemblyAI, Deepgram, your own model. The library ships TranscriptionServiceOpenAI as the canonical reference implementation.
A useful pattern is wrapping your service in a guard that returns a clean 4xx when credentials aren't configured, instead of an opaque 5xx from the underlying SDK:
import type { NextRequest } from "next/server";import { CopilotRuntime, TranscriptionService, createCopilotRuntimeHandler,} from "@copilotkit/runtime/v2";import type { TranscribeFileOptions } from "@copilotkit/runtime/v2";import { HttpAgent } from "@ag-ui/client";import { TranscriptionServiceOpenAI } from "@copilotkit/voice";import OpenAI from "openai";const AGENT_URL = process.env.AGENT_URL || "http://localhost:8000";const voiceDemoAgent = new HttpAgent({ url: `${AGENT_URL}/voice` });/** * Transcription service wrapper that pins `baseURL` to real OpenAI (or * `OPENAI_TRANSCRIPTION_BASE_URL` when explicitly set) instead of falling * through to `OPENAI_BASE_URL`. In local docker / Railway preview * environments `OPENAI_BASE_URL` points at aimock so LLM completions stay * deterministic, but aimock's proxy mode mangles multipart audio bodies on * forward — Whisper rejects with `502 Invalid file format` even when the * recorded webm/opus bytes are valid. Bypassing aimock for transcription * lets real Whisper see the original bytes and the demo's mic round-trip * actually works. Mirrors what langgraph-python does in its voice route. * * The sample-audio button is the deterministic affordance (synchronous * text injection); the mic is the only path that should exercise real * Whisper. */class GuardedOpenAITranscriptionService extends TranscriptionService { private delegate: TranscriptionServiceOpenAI | null; constructor() { super(); const apiKey = process.env.OPENAI_API_KEY; const baseURL = process.env.OPENAI_TRANSCRIPTION_BASE_URL ?? "https://api.openai.com/v1"; this.delegate = apiKey ? new TranscriptionServiceOpenAI({ openai: new OpenAI({ apiKey, baseURL }), }) : null; } async transcribeFile(options: TranscribeFileOptions): Promise<string> { if (!this.delegate) { throw new Error( "OPENAI_API_KEY not configured for this deployment (api key missing). " + "Set OPENAI_API_KEY to enable voice transcription.", ); } return this.delegate.transcribeFile(options); }}