Release code

khanrc · Mar 27, 2023 · e643d68 · e643d68
1 parent ceb5b28
commit e643d68
Show file tree

Hide file tree

Showing 63 changed files with 5,631 additions and 12 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,9 @@
+*.pyc
+.vscode
+__pycache__
+output
+.ipynb_checkpoints
+notebooks
+tcp-checker
+checkpoints/
+data/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 Kakao Brain Corp.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -1,40 +1,217 @@
-# TCL: Text-grounded Contrastive Learning for Unsupervised Open-world Semantic Segmentation
+# TCL: Text-grounded Contrastive Learning (CVPR'23)
 
-[**Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs**](https://arxiv.org/abs/2212.00785)
+Official PyTorch implementation of [**Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs**](https://arxiv.org/abs/2212.00785), *Junbum Cha, Jonghwan Mun, Byungseok Roh*, CVPR 2023.
 
-Junbum Cha, Jonghwan Mun, and Byungseok Roh.
+**T**ext-grounded **C**ontrastive **L**earning (TCL) is an open-world semantic segmentation framework using only image-text pairs. TCL enables a model to learn region-text alignment without train-test discrepancy.
 
-The code will be released soon.
+We will release a demo soon.
 
+<!-- <div align="center">                                      -->
+<!-- <figure>                                                  -->
+<!--   <img alt="" src="./assets/radar_chart.jpg" width="480"> -->
+<!-- </figure>                                                 -->
+<!-- </div>                                                    -->
 <div align="center">
 <figure>
-  <img alt="" src="./assets/radar_chart.jpg" width="480">
+  <img alt="" src="./assets/method.jpg">
 </figure>
 </div>
 
 
-## Visual examples
+## Results
 
-- Qualitative examples in PASCAL VOC
+TCL can perform segmentation on both (a, c) existing segmentation benchmarks and (b) arbitrary concepts, such as proper nouns and free-form text, in the wild images.
 
+<div align="center">
+<figure>
+  <img alt="" src="./assets/main.jpg">
+</figure>
+</div>
+
+<br/>
+
+<details>
+<summary> Additional examples in PASCAL VOC </summary>
 <p align="center">
   <img src="./assets/examples-voc.jpg" width="800" />
 </p>
+</details>
 
-- Qualitative examples in the wild
-
+<details>
+<summary> Additional examples in the wild </summary>
 <p align="center">
   <img src="./assets/examples-in-the-wild.jpg" width="800" />
 </p>
+</details>
+
+
+## Dependencies
+
+We used pytorch 1.12.1 and torchvision 0.13.1.
+
+```bash
+pip install -U openmim
+mim install mmcv-full==1.6.2 mmsegmentation==0.27.0
+pip install -r requirements.txt
+```
+
+Note that the order of requirements roughly represents the importance of the version.
+We recommend using the same version for at least `webdataset`, `mmsegmentation`, and `timm`.
+
+
+## Datasets
+
+Note that much of this section is adapted from the [data preparation section of GroupViT README](https://github.com/NVlabs/GroupViT#data-preparation).
+
+We use [webdataset](https://webdataset.github.io/webdataset/) as scalable data format in training and [mmsegmentation](https://github.com/open-mmlab/mmsegmentation) for semantic segmentation evaluation.
+
+The overall file structure is as follows:
+
+```shell
+TCL
+├── data
+│   ├── gcc3m
+│   │   ├── gcc-train-000000.tar
+│   │   ├── ...
+│   ├── gcc12m
+│   │   ├── cc-000000.tar
+│   │   ├── ...
+│   ├── cityscapes
+│   │   ├── leftImg8bit
+│   │   │   ├── train
+│   │   │   ├── val
+│   │   ├── gtFine
+│   │   │   ├── train
+│   │   │   ├── val
+│   ├── VOCdevkit
+│   │   ├── VOC2012
+│   │   │   ├── JPEGImages
+│   │   │   ├── SegmentationClass
+│   │   │   ├── ImageSets
+│   │   │   │   ├── Segmentation
+│   │   ├── VOC2010
+│   │   │   ├── JPEGImages
+│   │   │   ├── SegmentationClassContext
+│   │   │   ├── ImageSets
+│   │   │   │   ├── SegmentationContext
+│   │   │   │   │   ├── train.txt
+│   │   │   │   │   ├── val.txt
+│   │   │   ├── trainval_merged.json
+│   │   ├── VOCaug
+│   │   │   ├── dataset
+│   │   │   │   ├── cls
+│   ├── ade
+│   │   ├── ADEChallengeData2016
+│   │   │   ├── annotations
+│   │   │   │   ├── training
+│   │   │   │   ├── validation
+│   │   │   ├── images
+│   │   │   │   ├── training
+│   │   │   │   ├── validation
+│   ├── coco_stuff164k
+│   │   ├── images
+│   │   │   ├── train2017
+│   │   │   ├── val2017
+│   │   ├── annotations
+│   │   │   ├── train2017
+│   │   │   ├── val2017
+```
+
+The instructions for preparing each dataset are as follows.
+
+### Training datasets
+
+In training, we use Conceptual Caption 3m and 12m. We use [img2dataset](https://github.com/rom1504/img2dataset) tool to download and preprocess the datasets.
+
+#### GCC3M
+
+Please download the training split annotation file from [Conceptual Caption 3M](https://ai.google.com/research/ConceptualCaptions/download) and name it as `gcc3m.tsv`.
+
+Then run `img2dataset` to download the image text pairs and save them in the webdataset format.
+```
+sed -i '1s/^/caption\turl\n/' gcc3m.tsv
+img2dataset --url_list gcc3m.tsv --input_format "tsv" \
+            --url_col "url" --caption_col "caption" --output_format webdataset \
+            --output_folder data/gcc3m \
+            --processes_count 16 --thread_count 64 \
+            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
+            --enable_wandb True --save_metadata False --oom_shard_count 6
+rename -d 's/^/gcc-train-/' data/gcc3m/*
+```
+Please refer to [img2dataset CC3M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md) for more details.
+
+#### GCC12M
+
+Please download the annotation file from [Conceptual Caption 12M](https://github.com/google-research-datasets/conceptual-12m) and name it as `gcc12m.tsv`.
+
+Then run `img2dataset` to download the image text pairs and save them in the webdataset format.
+```
+sed -i '1s/^/caption\turl\n/' gcc12m.tsv
+img2dataset --url_list gcc12m.tsv --input_format "tsv" \
+            --url_col "url" --caption_col "caption" --output_format webdataset \
+            --output_folder data/gcc12m \
+            --processes_count 16 --thread_count 64 \
+            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
+            --enable_wandb True --save_metadata False --oom_shard_count 6
+rename -d 's/^/cc-/' data/gcc12m/*
+```
+Please refer to [img2dataset CC12M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc12m.md) for more details.
+
+
+### Evaluation datasets
+
+In the paper, we use 8 benchmarks; (i) w/ background: PASCAL VOC20, PASCAL Context59, and COCO-Object, and (ii) w/o background: PASCAL VOC, PASCAL Context, COCO-Stuff, Cityscapes, and ADE20k.
+Since some benchmarks share the data sources (e.g., VOC20 and VOC), we need to prepare 5 datasets: PASCAL VOC, PASCAL Context, COCO-Stuff164k, Cityscapes, and ADE20k.
+
+Please download and setup [PASCAL VOC](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-voc), [PASCAL Context](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-context), [COCO-Stuff164k](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#coco-stuff-164k), [Cityscapes](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#cityscapes), and [ADE20k](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#ade20k) datasets following [MMSegmentation data preparation document](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md).
+
+#### COCO Object
+
+COCO-Object dataset uses only object classes from COCO-Stuff164k dataset by collecting instance semgentation annotations.
+Run the following command to convert instance segmentation annotations to semantic segmentation annotations:
+
+```shell
+python convert_dataset/convert_coco.py data/coco_stuff164k/ -o data/coco_stuff164k/
+```
+
+
+## Training
+
+We use 16 and 8 NVIDIA V100 GPUs for the main and ablation experiments, respectively.
+
+### Single node
+
+```
+torchrun --rdzv_endpoint=localhost:5 --nproc_per_node=auto main.py --cfg ./configs/tcl.yml
+```
+
+### Multi node
+
+```
+torchrun --rdzv_endpoint=$HOST:$PORT --nproc_per_node=auto --nnodes=$NNODES --node_rank=$RANK main.py --cfg ./configs/tcl.yml
+```
+
+## Evaluation
+
+Zero-shot transfer to semantic segmentation:
+
+```
+torchrun --rdzv_endpoint=localhost:5 --nproc_per_node=auto main.py --resume checkpoints/tcl.pth --eval
+```
 
 
 ## Citation
 
 ```bibtex
-@article{cha2022tcl,
+@inproceedings{cha2022tcl,
   title={Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs},
   author={Cha, Junbum and Mun, Jonghwan and Roh, Byungseok},
-  journal={arXiv preprint arXiv:2212.00785},
-  year={2022}
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+  year={2023}
 }
 ```
+
+
+## License
+
+This project is released under [MIT license](./LICENSE).
diff --git a/assets/main.jpg b/assets/main.jpg
diff --git a/assets/method.jpg b/assets/method.jpg
diff --git a/configs/default.yml b/configs/default.yml
@@ -0,0 +1,90 @@
+_base_: "eval.yml"
+
+data:
+  batch_size: 256
+  pin_memory: true
+  num_workers: 6
+  seed: ${train.seed}
+  dataset:
+    meta:
+      gcc3m:
+        type: img_txt_pair
+        path: ./data/gcc3m
+        prefix: gcc-train-{000000..00347}.tar
+        length: 2881393
+      gcc12m:
+        type: img_txt_pair
+        path: ./data/gcc12m
+        prefix: cc-{000000..001175}.tar
+        length: 11286526
+    train:
+      - gcc3m
+      - gcc12m
+
+  img_aug:
+    deit_aug: true
+    img_size: 224
+    img_scale: [0.08, 1.0]
+    interpolation: bilinear
+    color_jitter: 0.4
+    auto_augment: 'rand-m9-mstd0.5-inc1'
+    re_prob: 0.25
+    re_mode: 'pixel'
+    re_count: 1
+  text_aug: null
+
+train:
+  start_step: 0
+  total_steps: 50000
+  warmup_steps: 20000
+  ust_steps: 0
+  base_lr: 1.6e-3
+  weight_decay: 0.05
+  min_lr: 4e-5
+  clip_grad: 5.0
+  fp16: true
+  fp16_comm: true # use fp16 grad compression for multi-node training
+  seed: 0
+
+  lr_scheduler:
+    name: cosine
+
+  optimizer:
+    name: adamw
+    eps: 1e-8
+    betas: [0.9, 0.999]
+
+
+evaluate:
+  pamr: false
+  kp_w: 0.0
+  bg_thresh: 0.5
+
+  save_logits: null
+
+  eval_only: false
+  eval_freq: 5000
+  template: simple
+  task:
+    - voc
+    - voc20
+    - context
+    - context59
+    - coco_stuff
+    - coco_object
+    - cityscapes
+    - ade20k
+
+
+checkpoint:
+  resume: ''
+  save_topk: 0
+  save_all: false  # if true, save every evaluation step
+
+
+model_name: "default"  # display name in the logger
+output: ???
+tag: default
+print_freq: 20
+seed: 0
+wandb: false
diff --git a/configs/eval.yml b/configs/eval.yml
@@ -0,0 +1,30 @@
+evaluate:
+  pamr: true
+  bg_thresh: 0.4
+  kp_w: 0.3
+
+  eval_only: true
+  template: custom
+  task:
+    - voc
+    - voc20
+    - context
+    - context59
+    - coco_stuff
+    - coco_object
+    - cityscapes
+    - ade20k
+
+  # training splits
+  t_voc20: segmentation/configs/_base_/datasets/t_pascal_voc12_20.py
+  t_context59: segmentation/configs/_base_/datasets/t_pascal_context59.py
+
+  # evaluation
+  voc: segmentation/configs/_base_/datasets/pascal_voc12.py
+  voc20: segmentation/configs/_base_/datasets/pascal_voc12_20.py
+  context: segmentation/configs/_base_/datasets/pascal_context.py
+  context59: segmentation/configs/_base_/datasets/pascal_context59.py
+  coco_stuff: segmentation/configs/_base_/datasets/stuff.py
+  coco_object: segmentation/configs/_base_/datasets/coco.py
+  cityscapes: segmentation/configs/_base_/datasets/cityscapes.py
+  ade20k: segmentation/configs/_base_/datasets/ade20k.py