Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

604 write gemini realtime client blogpost #686

Merged
merged 19 commits into from
Jan 30, 2025
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
51 changes: 51 additions & 0 deletions website/_blogs/2025-01-29-RealtimeAgent-with-gemini/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
title: RealtimeAgent with Gemini API
authors:
- stellaxiang
- marklysze
- sternakt
- davorrunje
tags: [Realtime API, Voice Agents, Gemini]

---

![Realtime agent communication with Gemini live API](img/RealtimeAgent_gemini.png)

**TL;DR:**

- RealtimeAgent now supports [Gemini Multimodal Live API](https://ai.google.dev/api/multimodal-live)

**Why is this important?**

We previously supported a Realtime Agent powered by OpenAI. In December 2024, Google rolled out [Gemini 2.0](https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/), which includes the multi-modal live APIs. These APIs enable advanced capabilities such as real-time processing of multimodal inputs (text, images, and audio) in live conversational settings. To ensure developers can fully leverage the capabilities of the latest LLMs, we now also support a Realtime Agent powered by Gemini.

**How to Use?**

To ensure a seamless experience for developers, we aim to minimize the required changes. The key step is to properly configure your LLM settings, including credentials, LLM setup, and tags. Once this is done, switching between different LLMs becomes straightforward.

For a practical walkthrough, refer to this [Example project](https://github.com/ag2ai/realtime-agent-over-websockets), which demonstrates how to instantiate a Gemini client and configure it within AG2. This includes an overview of WebSocket integration, a key feature for live-streaming scenarios.

For the highlights and demos of Gemini 2.0, check out their [official tech blog](https://developers.googleblog.com/en/the-next-chapter-of-the-gemini-era-for-developers/).

**Key Features of Gemini Integration**

1. **Optimized for Realtime**
Low-latency processing makes Gemini a strong choice for live applications. Combined with AG2's orchestration capabilities, developers can build robust systems that respond in near real time.

**Considerations**

During the implementation of this agent, we observed that audio truncation is not currently natively supported by Gemini. For instance, if the server generates a 10-second audio clip, and only the first 5 seconds are played while the rest is truncated, the server may remain unaware that the remaining 5 seconds were not played.

This limitation highlights an important consideration for applications that rely on fine-grained control over audio playback, such as interactive storytelling or call center solutions.

However, the APIs and models are evolving rapidly, and things could change very fast. With AG2, switching between models is streamlined, making it easier to adapt to different use cases and overcome such challenges.

**Future Potential**

Our decision to support Gemini opens up exciting possibilities for developers:

1. **Customizable LLM Pipelines**
AG2’s architecture allows developers to orchestrate workflows involving Gemini and other LLMs (e.g., OpenAI or Cohere), enabling the creation of highly tailored solutions.

3. **Expanding Model Ecosystem**
By supporting both OpenAI and Gemini, AG2 ensures flexibility and scalability for developers. This reduces vendor lock-in and allows teams to experiment with the strengths of different models to meet their specific needs.
6 changes: 6 additions & 0 deletions website/_blogs/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -201,3 +201,9 @@ rjambrecic:
title: Machine Learning Engineer at Airt
url: https://github.com/rjambrecic
image_url: https://github.com/rjambrecic.png

stellaxiang:
name: Stella Xiang
title: Developer
url: https://www.linkedin.com/in/xingyu-xiang/
image_url: https://github.com/stellaxiang.png
Loading