Enable Voice Mode
Upgrade your agent from text-only to real-time voice interaction.
Overview
Voice mode lets users talk to your agent naturally using their microphone. HUMA handles speech recognition, AI processing, and text-to-speech — your client just needs to join a Daily.co audio room and listen for events.
Listen
User speech is transcribed in real-time. Only finalized transcripts trigger the agent, so partial words don't cause premature responses.
Speak
The agent responds with natural speech played directly into the audio room. You can customize the voice using ElevenLabs voice IDs.
What Changes from Text Mode
| Aspect | Text Mode | Voice Mode |
|---|---|---|
| Router Type | conversational | turn-taking |
| Communication Tool | Messaging addon | speak (automatic) |
| Input | Client sends context-update events | Transcribed speech (automatic) |
| Audio Transport | None | Daily.co room |
| Metadata | Standard | + voice.enabled: true |
Setup
1. Create a voice-enabled agent
Add voice config and set routerType to turn-taking.
{
className: 'Assistant',
personality: 'Friendly voice assistant.',
instructions: 'Respond concisely in 1-2 sentences.',
routerType: 'turn-taking',
tools: [],
voice: {
enabled: true,
voiceId: 'EXAVITQu4vr4xnSDxMaL' // Optional ElevenLabs voice ID
}
}2. Join a Daily.co room (client-side)
Use the Daily.co SDK to join an audio room from the browser. This gives the user a microphone in the room.
import DailyIframe from '@daily-co/daily-js';
const callFrame = DailyIframe.createCallObject({
audioSource: true,
videoSource: false,
});
await callFrame.join({ url: roomUrl });3. Tell HUMA to join the same room
Send a join-daily-room message via your WebSocket connection.
socket.emit('message', {
type: 'join-daily-room',
roomUrl: 'https://your-domain.daily.co/room-name'
});Automatic speak tool
You don't need to define the speak tool in your tools array. HUMA adds it automatically when routerType is turn-taking and voice is enabled.
Common Pitfalls
Missing voice.enabled
Without voice: { enabled: true } in metadata, the join-daily-room command will return an error.
Wrong router type
Using conversational instead of turn-taking means the agent won't have the speak tool and can't respond with voice.
Echo / Feedback Loops
Use headphones during development. If the microphone picks up the agent's speech, it creates a loop. Daily.co has echo cancellation enabled by default in production.
Latency perception
There's typically ~1-2s between the user finishing their sentence and the agent responding. Use visual indicators ("thinking" animations) to keep the experience feeling responsive.
Next Steps
Now that voice is enabled, learn about the voice lifecycle events to build a responsive UI.
Next: Voice Lifecycle