Integration Guide · Phase 7

Enable Voice Mode

Upgrade your agent from text-only to real-time voice interaction.

Overview

Voice mode lets users talk to your agent naturally using their microphone. HUMA handles speech recognition, AI processing, and text-to-speech — your client just needs to join a Daily.co audio room and listen for events.

Listen

User speech is transcribed in real-time. Only finalized transcripts trigger the agent, so partial words don't cause premature responses.

Speak

The agent responds with natural speech played directly into the audio room. You can customize the voice using ElevenLabs voice IDs.

What Changes from Text Mode

AspectText ModeVoice Mode
Router Typeconversationalturn-taking
Communication ToolMessaging addonspeak (automatic)
InputClient sends context-update eventsTranscribed speech (automatic)
Audio TransportNoneDaily.co room
MetadataStandard+ voice.enabled: true

Setup

1. Create a voice-enabled agent

Add voice config and set routerType to turn-taking.

Agent Metadata
{
  className: 'Assistant',
  personality: 'Friendly voice assistant.',
  instructions: 'Respond concisely in 1-2 sentences.',
  routerType: 'turn-taking',
  tools: [],
  voice: {
    enabled: true,
    voiceId: 'EXAVITQu4vr4xnSDxMaL'  // Optional ElevenLabs voice ID
  }
}

2. Join a Daily.co room (client-side)

Use the Daily.co SDK to join an audio room from the browser. This gives the user a microphone in the room.

Frontend Code
import DailyIframe from '@daily-co/daily-js';

const callFrame = DailyIframe.createCallObject({
  audioSource: true,
  videoSource: false,
});

await callFrame.join({ url: roomUrl });

3. Tell HUMA to join the same room

Send a join-daily-room message via your WebSocket connection.

WebSocket Message
socket.emit('message', {
  type: 'join-daily-room',
  roomUrl: 'https://your-domain.daily.co/room-name'
});

Automatic speak tool

You don't need to define the speak tool in your tools array. HUMA adds it automatically when routerType is turn-taking and voice is enabled.

Common Pitfalls

Missing voice.enabled

Without voice: { enabled: true } in metadata, the join-daily-room command will return an error.

Wrong router type

Using conversational instead of turn-taking means the agent won't have the speak tool and can't respond with voice.

Echo / Feedback Loops

Use headphones during development. If the microphone picks up the agent's speech, it creates a loop. Daily.co has echo cancellation enabled by default in production.

Latency perception

There's typically ~1-2s between the user finishing their sentence and the agent responding. Use visual indicators ("thinking" animations) to keep the experience feeling responsive.

Next Steps

Now that voice is enabled, learn about the voice lifecycle events to build a responsive UI.

Next: Voice Lifecycle