Image Captioning Project

About

This project aims to create an automated image captioning system that generates natural language descriptions for input images by integrating techniques from computer vision and natural language processing. We employ various techniques ranging from CNN-RNN to more advanced transformer-based methods. Training is conducted on datasets of images paired with descriptive captions, and model performance is evaluated using established metrics such as BLEU, METEOR, and CIDEr. The project also involves experimentation with advanced attention mechanisms, comparisons of different architectural choices, and hyperparameter optimization to refine captioning accuracy and overall system effectiveness.

You can access the report here.

Models

We architected five different models, experimenting with different model blocks such as ViT, InceptionV3, and YOLO for the image encoder, and LSTM and Transformer Decoder for the decoder:

CNN-RNN
CNN-Attn
ViT-Attn
ViTCNN-Attn
YOLOCNN-Attn

Sample Captions Generated

Environment

conda env create -f environment.yml
source activate SC4001

Datasets

For this experiment, we utilized two datasets:

MSCOCO
Flickr30k

sh download-datasets.sh

Metrics

We utilized BLEU, METEOR, and CIDEr to evaluate the generated captions by the models.

Training

Training is done with the training and validation split.

For training with the config script, create a config YAML file inside the config folder and run the train.py file. You can refer to the sample_train_command.sh for more examples.

python train.py --config_file config-mscoco-cnnrnn.yaml ---embed_size={embed_size} \
                --batch_size={batch_size} --learning_rate={learning_rate}

For hyperparameter tuning, input the parameter search space in grid_train_script and run:

python grid_train_script.py

Evaluation

Evaluation is done on the test dataset split to provide the final model score on unseen training data.

For evaluation, you can run the eval.py with the given arguments as in training without the need to specify the config file. Example:

python eval.py --batch_size={batch_size} --learning_rate={learning_rate} --embed_size={embed_size} \
               --num_layers={num_layers} --model_arch={model} --dataset={dataset} \
               --checkpoint_dir={checkpoint_dir}

If you tuned the hyperparameters, you can run select_and_eval_model.py to evaluate the best model based on training logs. It will generate graphs of the best models and sample captions in the eval folder.

python select_and_eval_model.py

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
configs		configs
eval		eval
metric_logs		metric_logs
models		models
runs		runs
.gitignore		.gitignore
Image_Captioning_Report.pdf		Image_Captioning_Report.pdf
download-datasets.sh		download-datasets.sh
environment.yml		environment.yml
eval.py		eval.py
get_loader.py		get_loader.py
grid_train_script.py		grid_train_script.py
readme.md		readme.md
sample_train_command.sh		sample_train_command.sh
select_and_eval_model.py		select_and_eval_model.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Captioning Project

About

Models

Sample Captions Generated

Environment

Datasets

Metrics

Training

Evaluation

Experiment Results

MSCOCO

Flickr30k

About

Releases

Packages

Contributors 2

Languages

JeremyNathanJusuf/image-captioning

Folders and files

Latest commit

History

Repository files navigation

Image Captioning Project

About

Models

Sample Captions Generated

Environment

Datasets

Metrics

Training

Evaluation

Experiment Results

MSCOCO

Flickr30k

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages