Skip to content

Latest commit

 

History

History
165 lines (130 loc) · 6.23 KB

README.md

File metadata and controls

165 lines (130 loc) · 6.23 KB

Densecap-tensorflow

Implementation of CVPR2017 paper: Dense captioning with joint inference and visual context by Linjie Yang, Kevin Tang, Jianchao Yang, Li-Jia Li

WITH CHANGES:

  1. Borrow the idea of Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling, and tied word vectors and word classfiers during captioning.
  2. Initialize Word Vectors and Word Classifers with pre-trained glove word vectors with dimensions of 300.
  3. Change the backbone of the framework to ResNet-50.
  4. Add Beam Search and Length Normalization in test mode.
  5. Add "Limit_RAM" mode when praparing training date since my computer only has RAM with 8G.

Special thanks to valohai for offering computing resource.

Note

Update 2017.12.31

  • After 500k iterations of training with configurations of original paper (except for the weights tying of wordvec and classifiers), it achieves mAP 8.296.

Update 2017.12.20

  • After 1 epoch(80000 iters) of training with randomly initialized word vectors(512d), it achieves mAP 6.509.
  • After 1 epoch(75000) of training with pre-trianed glove word vectors(300d), it got mAP 5.5 nearly.
  • The complete training process will take almost 10 days with the computation I have access to, and I just trained 1 epoch to varify the framework for now.
  • The scripts should be compatible with both python 2.X and 3.X. Although I built it under python 2.7.
  • Tested on Ubuntu 16.04, tensorflow 1.4, CUDA 8.0 and cudnn 6.0, with GPU Nvidia gtx 1060(LOL...).

Dependencies

To install required python modules by:

pip install -r lib/requirements.txt

For evaluation, one also need:

To install java runtime by:

sudo apt-get install openjdk-8-jre

Preparing data

Download

Website of Visual Genome Dataset

  • Make a new directory VG wherever you like.
  • Download images Part1 and Part2, extract all (two parts) to directory VG/images
  • Download image meta data, extract to directory VG/1.2 or VG/1.0 according to the version you download.
  • Download region descriptions, extract to directory VG/1.2 or VG/1.0 accordingly.
  • For the following process, we will refer the absolute path of directory VG as raw_data_path, e.g. /home/user/git/VG.

Unlimit RAM

If one has RAM more than 16G, then you can preprocessing dataset with following command.

$ cd $ROOT/lib
$ python preprocess.py --version [version] --path [raw_data_path] \
        --output_dir [dir] --max_words [max_len]

Limit RAM (Less than 16G)

If one has RAM less than 16G.

  • Firstly, setting up the data path in info/read_regions.py accordingly, and run the script with python. Then it will dump regions in REGION_JSON directory. It will take time to process more than 100k images, so be patient.
$ cd $ROOT/info
$ python read_regions --version [version] --vg_path [raw_data_path]
  • In lib/preprocess.py, set up data path accordingly. After running the file, it will dump gt_regions of every image respectively to OUTPUT_DIR as directory.
$ cd $ROOT/lib
$ python preprocess.py --version [version] --path [raw_data_path] \
        --output_dir [dir] --max_words [max_len] --limit_ram

Compile local libs

$ cd root/lib
$ make

Train

Add or modify configurations in root/scripts/dense_cap_config.yml, refer to 'lib/config.py' for more configuration details.

$ cd $ROOT
$ bash scripts/dense_cap_train.sh [dataset] [net] [ckpt_to_init] [data_dir] [step]

Parameters:

  • dataset: visual_genome_1.2 or visual_genome_1.0.
  • net: res50, res101
  • ckpt_to_init: pretrained model to be initialized with. Refer to tf_faster_rcnn for more init weight details.
  • data_dir: the data directory where you save the outputs after prepare data.
  • step: for continue training.
    • step 1: fix convnet weights
    • stpe 2: finetune convnets weights
    • step 3: add context fusion, but fix convnets weights
    • step 4: finetune the whole model.

Demo

Create a directory data/demo

$ mkdir $ROOT/data/demo

Then put the images to be tested in the directory.

Download pretrained model (iters 500k) by Google Drive or Jbox. Then create a "output" directory under $ROOT

$ mkdir $ROOT/output

Extract the downloaded "ckpt.zip" to directory $ROOT/output. And run

$ cd $ROOT
$ bash scripts/dense_cap_demo.sh ./output/ckpt ./output/ckpt/vocabulary.txt

or run

$ bash scripts/dense_cap_demo.sh [ckpt_path] [vocab_path]

for your customized checkpoint directory.

It will create html files in $ROOT/demo, just click it. Or you can use the web-based visualizer created by karpathy by

$ cd $ROOT/vis
$ python -m SimpleHTTPServer 8181

Then point your web brower to http://localhost:8181/view_results.html.

TODO:

  • preprocessing dataset.
  • roi_data_layer & get data well prepared for feeding.
  • proposal layer
  • sentense data layer
  • embedding layer
  • get loc loss and caption loss
  • overfit a mini-batch
  • context fusion
  • add experiment result.

References