Overview

GeminiMultimodalLiveLLMService enables natural, real-time conversations with Google’s Gemini model. It provides built-in audio transcription, voice activity detection, and context management for creating interactive AI experiences with multimodal capabilities including audio, video, and text processing.
Want to start building? Check out our Gemini Multimodal Live Guide.

Installation

To use Gemini Multimodal Live services, install the required dependencies:
pip install "pipecat-ai[google]"

Prerequisites

Google AI Setup

Before using Gemini Multimodal Live services, you need:
  1. Google Account: Set up at Google AI Studio
  2. API Key: Generate a Gemini API key from AI Studio
  3. Model Access: Ensure access to Gemini Live models
  4. Multimodal Configuration: Set up audio, video, and text modalities

Required Environment Variables

  • GOOGLE_API_KEY: Your Google Gemini API key for authentication

Key Features

  • Multimodal Processing: Handle audio, video, and text inputs simultaneously
  • Real-time Streaming: Low-latency audio and video processing
  • Voice Activity Detection: Automatic speech detection and turn management
  • Function Calling: Advanced tool integration and API calling capabilities
  • Context Management: Intelligent conversation history and system instruction handling