Phi-3.5-vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision.
This model enables multi-frame image understanding, image comparison, multi-image summarization/storytelling, and video summarization, which have broad applications in office scenarios.
Follow these steps to set up and run the project:
i. Download and Install NVIDIA CUDA
Visit the NVIDIA CUDA Toolkit Downloads page and follow the instructions to install CUDA compatible with your system.
ii. Install Required Python Packages
Ensure you have all the necessary dependencies installed by running the following commands:
pip install -r requirements.txt
pip install flash_attn
If you encounter any issues while installing flash_attn
, refer to the FlashAttention Installation Guide for troubleshooting tips and additional setup details.
Launch the API server powered by LitServe:
python server.py
Start the Streamlit application with the following command:
streamlit run app.py
This project is developed and maintained with ❤️ by Bhimraj Yadav.