Generating fashion product descriptions by fine-tuning a Vision-Language Model (VLM) with Amazon SageMaker and Amazon Bedrock
This repository implements a machine learning training and inference regiment, using Generative AI (GenAI) to answer questions based on provided images. Pre-trained models exist to achive such tasks, however they are a) unable to adapt to domain-specific scenarios - hence why we need to fine tune and b) do not display the capability to be deployed into production environments.
To solve this problem, this post shows how to extract domain-specific product attributes from product images by fine-tuning a VLM (Vision-Language Model) on a fashion dataset using Amazon SageMaker, and then use Amazon Bedrock to generate product descriptions using the extracted attributes as input.
For a detailed walkthrough of this repository, please refer to our blogpost.
The data used in this repository is taken from Kaggle Fashion Images Dataset and the usecase we try to solve is generating captions for these fashion products for an e-commerce website, a task that has historically been very time consuming. High-quality product descriptions improve searchability through Search Engine Optimization (SEO), as well as increase customer satisfaction by allowing them to make informed decisions.
The model finetuned in this repository is the BLIP-2 Model and more specifically, a variant of it using Flan-T5-XL.
The following diagram illustrates the overview of BLIP-2:
The solution can be broken down into two sections, marked green and blue in the achitecture below: a) fine-tuning in green and b) inference in blue.
- The data is downloaded in an S3 bucket
- A subset of the data is used to fine-tune a model, using a Sagemaker Training Job
- The fine-tuned model artefacts are then stored in an S3 bucket to be used for inference
- The model artefacts on S3 are spin up via a Sagemaker Endpoint
- The Endpoint is then invoked with a question and returns a json-like response containing the relevant product attributes
- The response is then passed to Amazon Bedrock alongside with a pre-defined prompt, which formats the response before returning it.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.