Overview

Smart Turn Detection uses an advanced machine learning model to determine when a user has finished speaking and your bot should respond. Unlike basic Voice Activity Detection (VAD) which only detects speech vs. non-speech, Smart Turn Detection recognizes natural conversational cues like intonation patterns and linguistic signals for more natural conversations.

Key Benefits

  • Natural conversations: More human-like turn-taking patterns
  • Free to use: The model is fully open source
  • Scalable: Smart Turn v3 supports fast CPU inference directly inside your Pipecat Cloud instance

Quick Start

To enable Smart Turn Detection in your Pipecat Cloud bot, add the LocalSmartTurnAnalyzerV3 analyzer to your transport configuration. The model weights are bundled with Pipecat, so there’s no need to download them separately.
import aiohttp
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.transports.services.daily import DailyParams, DailyTransport

async def main(room_url: str, token: str):
    async with aiohttp.ClientSession() as session:
        transport = DailyTransport(
            room_url,
            token,
            "Voice AI Bot",
            DailyParams(
                audio_in_enabled=True,
                audio_out_enabled=True,
                # Set VAD to 0.2 seconds for optimal Smart Turn performance
                vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
                # Enable local Smart Turn inference using the weights bundled with Pipecat
                turn_analyzer=LocalSmartTurnAnalyzerV3(),
            ),
        )

        # Continue with your pipeline setup...
Smart Turn Detection requires VAD to be enabled with stop_secs=0.2. This value mimics the training data and allows Smart Turn to dynamically adjust timing based on the model’s predictions.

How It Works

  1. Audio Analysis: The system continuously analyzes incoming audio for speech patterns
  2. VAD Processing: Voice Activity Detection segments audio into speech and silence
  3. Turn Classification: When VAD detects a pause, the ML model analyzes the speech segment for natural completion cues
  4. Smart Response: The model determines if the turn is complete or if the user is likely to continue speaking

Training Data Collection

The smart-turn model is trained on real conversational data collected through these applications. Help us improve the model by contributing your own data or classifying existing data:

More information

For more details on Smart Turn, see the following links: