This project focuses on embedding and retrieval of a large-scale fashion product dataset collected from major brands such as Aarong, Allen Solly, Bata, Apex, and Infinity. The dataset consists of over 20,000 products, covering a wide variety of categories and styles. The notebook leverages powerful models and tools to create embeddings for both text and images, and then stores these embeddings in a vector database using Qdrant. This setup enables efficient and accurate retrieval of fashion products based on semantic similarity.
The dataset, hosted on Hugging Face, includes over 20,000 fashion products scraped from multiple sources, with details like product category, company, name, description, specifications, image links, and more. You can explore the dataset here.
- Text Embeddings: The notebook uses OpenAI's
text-embedding-3-large
model to create high-dimensional embeddings for the product descriptions and summaries. - Image Embeddings: CLIP (
clip-ViT-B-32
) from theSentenceTransformer
library is employed to generate image embeddings. This model captures visual features that can be used to find similar products based on their appearance.
For each product, a summary string is generated, capturing key details such as category, company, name, and specifications. This string is then embedded using the text model. Simultaneously, the primary product image is downloaded, processed, and encoded to generate an image embedding. Both embeddings are stored in the Qdrant collection for efficient vector search.
The Qdrant database is employed as a vector store for these embeddings, supporting real-time similarity searches based on both text and image queries. The notebook creates a collection that accommodates both summary and image vectors using cosine similarity.
The notebook iterates over the dataset and:
- Generates unique document IDs.
- Prepares summary strings for text embedding.
- Downloads and processes product images.
- Computes embeddings for both text and images.
- Stores the embeddings and related metadata (like product ID, links, and descriptions) into Qdrant.
This setup allows seamless integration into any system requiring fashion product recommendations or search functionality based on multi-modal data.
The image above showcases the number of vector points stored in the Qdrant collection, visualizing the scale of the dataset and the embeddings stored.
- Clone the repository and install the necessary dependencies.
- Load the dataset from Hugging Face.
- Run the notebook to start embedding and storing fashion products in Qdrant.
The project is an excellent resource for anyone looking to explore multi-modal embeddings, vector databases, and fashion data at scale.
This project utilizes the LLaVA (Language and Vision Assistant) model to generate product descriptions and specifications from images. The model is based on a conversational AI architecture that can interact with both text and visual inputs.
Before running the code, ensure you have the following dependencies installed:
- Python 3.7+
- Google Colab or a local environment with GPU support
- Hugging Face's
transformers
anddatasets
libraries torch
for PyTorch supportPIL
for image processing
-
Install the LLaVA package:
!pip install git+https://github.com/haotian-liu/LLaVA.git@786aa6a19ea10edc6f574ad2e16276974e9aaa3a
-
Install additional dependencies:
!pip install -qU datasets
-
Initialize the LLaVA ChatBot:
from transformers import AutoTokenizer, BitsAndBytesConfig from llava.model import LlavaLlamaForCausalLM from llava.utils import disable_torch_init from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN from llava.mm_utils import tokenizer_image_token, KeywordsStoppingCriteria from llava.conversation import conv_templates, SeparatorStyle import torch from PIL import Image import requests from io import BytesIO chatbot = LLaVAChatBot(load_in_8bit=True, bnb_8bit_compute_dtype=torch.float16, bnb_8bit_use_double_quant=True, bnb_8bit_quant_type='nf8')
-
Load the dataset:
from datasets import load_dataset fashion = load_dataset( "thegreyhound/demo2", split="train" ) product_df = fashion.to_pandas()
-
Generate product descriptions and specifications:
cnt = 1 for index, row in product_df.iterrows(): str1 = "Given Image detail was: " + row['Description'] + " Now generate a brief high level description for the product shown in the image" str2 = "Given Image detail was: " + row['Description'] + " Now generate a detailed specifications for the product shown in the image including the fabric, color, design, style etc" ans1 = chatbot.start_new_chat(img_path=row['Image_link'], prompt=str1) ans2 = chatbot.start_new_chat(img_path=row['Image_link'], prompt=str2) product_df.loc[index, 'Description'] = ans1 product_df.loc[index, 'Specifications'] = ans2 print(cnt) cnt += 1
The script processes images and generates high-level product descriptions and detailed specifications. The final output is saved in a JSON file containing an array of product information.
This project is licensed under the MIT License - see the LICENSE file for details.
You can find more details and access the dataset at Hugging Face.