Overview

UltravoxSTTService provides real-time speech-to-text using the Ultravox multimodal model running locally. Ultravox directly encodes audio into the LLM’s embedding space, eliminating traditional ASR components and providing faster, more efficient transcription with built-in conversational understanding.

Installation

To use Ultravox services, install the required dependency:
pip install "pipecat-ai[ultravox]"

Prerequisites

Ultravox Model Setup

Before using Ultravox STT services, you need:
  1. Hugging Face Account: Sign up at Hugging Face
  2. HF Token: Generate a Hugging Face token for model access
  3. GPU Resources: Recommended for optimal performance with local model inference

Required Environment Variables

  • HF_TOKEN: Your Hugging Face token for model access