Skip to content

tuanlda78202/geminio

Repository files navigation

Bông - A Gemini-o Web Application

Overview

As part of Google for Developers' mission to build for the community, the comprehensive workflow of the web application presented at the AI booth during Google I/O Extended Hanoi 2024 is shared. This application, named Bông, is a real-time VLM web app featuring both voice input and output capabilities.

Architecture
Speak, See, and Interact with Bông

Features

  • Real-time VLM Web App: Supports both voice input and output for interactive experiences.
  • Multimodal Model Integration: Utilizes Gemini 1.5 Flash for handling diverse inputs including audio, images, videos, and text.
  • Google Ecosystem Utilization: Employs Google's API and WaveNet TTS for enhanced communication capabilities in Vietnamese.
  • RAG Workflow: Incorporates Retrieval-Augmented Generation to keep the app updated with event information and GDG Hanoi news.
  • Natural and Humorous Responses: Designed to engage attendees with real-time, context-aware interactions.

Technical Details

Model and Processing

  • Gemini 1.5 Flash: A lightweight model optimized for speed and efficiency at scale, supporting up to 1M context lengths.
  • Multimodal Input: Accepts inputs from webcam videos, microphone speech recognition, and other media types.
  • Google Cloud's WaveNet TTS: Enhances the app's ability to communicate naturally in Vietnamese.

Workflow

  1. Embedding Extraction: Uses Google Text Embedding API to extract embeddings from text information on URLs.
  2. Chain Construction with LangChain: Constructs a system prompt incorporating conversational history for memory caching.
  3. Real-time Response: The web application responds in real-time despite noisy environments and multiple individuals in the frame.

Installation

  1. Clone the repository:

    git clone https://github.com/tuanlda78202/geminio.git
  2. Navigate to the project directory:

    cd geminio
  3. Install dependencies:

    npm install
  4. Download .google-cloud-credentials from Google Cloud and set up VITE_GEMINI_KEY, GOOGLE_APPLICATION_CREDENTIALS in .env.

  5. Run the application:

    npm run dev

Usage

  1. Open port 3001 for Google Cloud TTS.
  2. Open your browser and navigate to http://localhost:3000.
  3. Allow access to your microphone and webcam.
  4. Interact with the application using voice commands and visual inputs.

Contributing

We welcome contributions from the community. Please follow these steps to contribute:

  1. Fork the repository.

  2. Create a new branch for your feature or bug fix:

    git checkout -b feature-name
  3. Commit your changes:

    git commit -m "Description of feature or fix"
  4. Push to the branch:

    git push origin feature-name
  5. Create a pull request on GitHub.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or feedback, please contact me.

Contributors