-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
63 changed files
with
5,631 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
*.pyc | ||
.vscode | ||
__pycache__ | ||
output | ||
.ipynb_checkpoints | ||
notebooks | ||
tcp-checker | ||
checkpoints/ | ||
data/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2023 Kakao Brain Corp. | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,40 +1,217 @@ | ||
# TCL: Text-grounded Contrastive Learning for Unsupervised Open-world Semantic Segmentation | ||
# TCL: Text-grounded Contrastive Learning (CVPR'23) | ||
|
||
[**Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs**](https://arxiv.org/abs/2212.00785) | ||
Official PyTorch implementation of [**Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs**](https://arxiv.org/abs/2212.00785), *Junbum Cha, Jonghwan Mun, Byungseok Roh*, CVPR 2023. | ||
|
||
Junbum Cha, Jonghwan Mun, and Byungseok Roh. | ||
**T**ext-grounded **C**ontrastive **L**earning (TCL) is an open-world semantic segmentation framework using only image-text pairs. TCL enables a model to learn region-text alignment without train-test discrepancy. | ||
|
||
The code will be released soon. | ||
We will release a demo soon. | ||
|
||
<!-- <div align="center"> --> | ||
<!-- <figure> --> | ||
<!-- <img alt="" src="./assets/radar_chart.jpg" width="480"> --> | ||
<!-- </figure> --> | ||
<!-- </div> --> | ||
<div align="center"> | ||
<figure> | ||
<img alt="" src="./assets/radar_chart.jpg" width="480"> | ||
<img alt="" src="./assets/method.jpg"> | ||
</figure> | ||
</div> | ||
|
||
|
||
## Visual examples | ||
## Results | ||
|
||
- Qualitative examples in PASCAL VOC | ||
TCL can perform segmentation on both (a, c) existing segmentation benchmarks and (b) arbitrary concepts, such as proper nouns and free-form text, in the wild images. | ||
|
||
<div align="center"> | ||
<figure> | ||
<img alt="" src="./assets/main.jpg"> | ||
</figure> | ||
</div> | ||
|
||
<br/> | ||
|
||
<details> | ||
<summary> Additional examples in PASCAL VOC </summary> | ||
<p align="center"> | ||
<img src="./assets/examples-voc.jpg" width="800" /> | ||
</p> | ||
</details> | ||
|
||
- Qualitative examples in the wild | ||
|
||
<details> | ||
<summary> Additional examples in the wild </summary> | ||
<p align="center"> | ||
<img src="./assets/examples-in-the-wild.jpg" width="800" /> | ||
</p> | ||
</details> | ||
|
||
|
||
## Dependencies | ||
|
||
We used pytorch 1.12.1 and torchvision 0.13.1. | ||
|
||
```bash | ||
pip install -U openmim | ||
mim install mmcv-full==1.6.2 mmsegmentation==0.27.0 | ||
pip install -r requirements.txt | ||
``` | ||
|
||
Note that the order of requirements roughly represents the importance of the version. | ||
We recommend using the same version for at least `webdataset`, `mmsegmentation`, and `timm`. | ||
|
||
|
||
## Datasets | ||
|
||
Note that much of this section is adapted from the [data preparation section of GroupViT README](https://github.com/NVlabs/GroupViT#data-preparation). | ||
|
||
We use [webdataset](https://webdataset.github.io/webdataset/) as scalable data format in training and [mmsegmentation](https://github.com/open-mmlab/mmsegmentation) for semantic segmentation evaluation. | ||
|
||
The overall file structure is as follows: | ||
|
||
```shell | ||
TCL | ||
├── data | ||
│ ├── gcc3m | ||
│ │ ├── gcc-train-000000.tar | ||
│ │ ├── ... | ||
│ ├── gcc12m | ||
│ │ ├── cc-000000.tar | ||
│ │ ├── ... | ||
│ ├── cityscapes | ||
│ │ ├── leftImg8bit | ||
│ │ │ ├── train | ||
│ │ │ ├── val | ||
│ │ ├── gtFine | ||
│ │ │ ├── train | ||
│ │ │ ├── val | ||
│ ├── VOCdevkit | ||
│ │ ├── VOC2012 | ||
│ │ │ ├── JPEGImages | ||
│ │ │ ├── SegmentationClass | ||
│ │ │ ├── ImageSets | ||
│ │ │ │ ├── Segmentation | ||
│ │ ├── VOC2010 | ||
│ │ │ ├── JPEGImages | ||
│ │ │ ├── SegmentationClassContext | ||
│ │ │ ├── ImageSets | ||
│ │ │ │ ├── SegmentationContext | ||
│ │ │ │ │ ├── train.txt | ||
│ │ │ │ │ ├── val.txt | ||
│ │ │ ├── trainval_merged.json | ||
│ │ ├── VOCaug | ||
│ │ │ ├── dataset | ||
│ │ │ │ ├── cls | ||
│ ├── ade | ||
│ │ ├── ADEChallengeData2016 | ||
│ │ │ ├── annotations | ||
│ │ │ │ ├── training | ||
│ │ │ │ ├── validation | ||
│ │ │ ├── images | ||
│ │ │ │ ├── training | ||
│ │ │ │ ├── validation | ||
│ ├── coco_stuff164k | ||
│ │ ├── images | ||
│ │ │ ├── train2017 | ||
│ │ │ ├── val2017 | ||
│ │ ├── annotations | ||
│ │ │ ├── train2017 | ||
│ │ │ ├── val2017 | ||
``` | ||
|
||
The instructions for preparing each dataset are as follows. | ||
|
||
### Training datasets | ||
|
||
In training, we use Conceptual Caption 3m and 12m. We use [img2dataset](https://github.com/rom1504/img2dataset) tool to download and preprocess the datasets. | ||
|
||
#### GCC3M | ||
|
||
Please download the training split annotation file from [Conceptual Caption 3M](https://ai.google.com/research/ConceptualCaptions/download) and name it as `gcc3m.tsv`. | ||
|
||
Then run `img2dataset` to download the image text pairs and save them in the webdataset format. | ||
``` | ||
sed -i '1s/^/caption\turl\n/' gcc3m.tsv | ||
img2dataset --url_list gcc3m.tsv --input_format "tsv" \ | ||
--url_col "url" --caption_col "caption" --output_format webdataset \ | ||
--output_folder data/gcc3m \ | ||
--processes_count 16 --thread_count 64 \ | ||
--image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \ | ||
--enable_wandb True --save_metadata False --oom_shard_count 6 | ||
rename -d 's/^/gcc-train-/' data/gcc3m/* | ||
``` | ||
Please refer to [img2dataset CC3M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md) for more details. | ||
|
||
#### GCC12M | ||
|
||
Please download the annotation file from [Conceptual Caption 12M](https://github.com/google-research-datasets/conceptual-12m) and name it as `gcc12m.tsv`. | ||
|
||
Then run `img2dataset` to download the image text pairs and save them in the webdataset format. | ||
``` | ||
sed -i '1s/^/caption\turl\n/' gcc12m.tsv | ||
img2dataset --url_list gcc12m.tsv --input_format "tsv" \ | ||
--url_col "url" --caption_col "caption" --output_format webdataset \ | ||
--output_folder data/gcc12m \ | ||
--processes_count 16 --thread_count 64 \ | ||
--image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \ | ||
--enable_wandb True --save_metadata False --oom_shard_count 6 | ||
rename -d 's/^/cc-/' data/gcc12m/* | ||
``` | ||
Please refer to [img2dataset CC12M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc12m.md) for more details. | ||
|
||
|
||
### Evaluation datasets | ||
|
||
In the paper, we use 8 benchmarks; (i) w/ background: PASCAL VOC20, PASCAL Context59, and COCO-Object, and (ii) w/o background: PASCAL VOC, PASCAL Context, COCO-Stuff, Cityscapes, and ADE20k. | ||
Since some benchmarks share the data sources (e.g., VOC20 and VOC), we need to prepare 5 datasets: PASCAL VOC, PASCAL Context, COCO-Stuff164k, Cityscapes, and ADE20k. | ||
|
||
Please download and setup [PASCAL VOC](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-voc), [PASCAL Context](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-context), [COCO-Stuff164k](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#coco-stuff-164k), [Cityscapes](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#cityscapes), and [ADE20k](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#ade20k) datasets following [MMSegmentation data preparation document](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md). | ||
|
||
#### COCO Object | ||
|
||
COCO-Object dataset uses only object classes from COCO-Stuff164k dataset by collecting instance semgentation annotations. | ||
Run the following command to convert instance segmentation annotations to semantic segmentation annotations: | ||
|
||
```shell | ||
python convert_dataset/convert_coco.py data/coco_stuff164k/ -o data/coco_stuff164k/ | ||
``` | ||
|
||
|
||
## Training | ||
|
||
We use 16 and 8 NVIDIA V100 GPUs for the main and ablation experiments, respectively. | ||
|
||
### Single node | ||
|
||
``` | ||
torchrun --rdzv_endpoint=localhost:5 --nproc_per_node=auto main.py --cfg ./configs/tcl.yml | ||
``` | ||
|
||
### Multi node | ||
|
||
``` | ||
torchrun --rdzv_endpoint=$HOST:$PORT --nproc_per_node=auto --nnodes=$NNODES --node_rank=$RANK main.py --cfg ./configs/tcl.yml | ||
``` | ||
|
||
## Evaluation | ||
|
||
Zero-shot transfer to semantic segmentation: | ||
|
||
``` | ||
torchrun --rdzv_endpoint=localhost:5 --nproc_per_node=auto main.py --resume checkpoints/tcl.pth --eval | ||
``` | ||
|
||
|
||
## Citation | ||
|
||
```bibtex | ||
@article{cha2022tcl, | ||
@inproceedings{cha2022tcl, | ||
title={Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs}, | ||
author={Cha, Junbum and Mun, Jonghwan and Roh, Byungseok}, | ||
journal={arXiv preprint arXiv:2212.00785}, | ||
year={2022} | ||
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, | ||
year={2023} | ||
} | ||
``` | ||
|
||
|
||
## License | ||
|
||
This project is released under [MIT license](./LICENSE). |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
_base_: "eval.yml" | ||
|
||
data: | ||
batch_size: 256 | ||
pin_memory: true | ||
num_workers: 6 | ||
seed: ${train.seed} | ||
dataset: | ||
meta: | ||
gcc3m: | ||
type: img_txt_pair | ||
path: ./data/gcc3m | ||
prefix: gcc-train-{000000..00347}.tar | ||
length: 2881393 | ||
gcc12m: | ||
type: img_txt_pair | ||
path: ./data/gcc12m | ||
prefix: cc-{000000..001175}.tar | ||
length: 11286526 | ||
train: | ||
- gcc3m | ||
- gcc12m | ||
|
||
img_aug: | ||
deit_aug: true | ||
img_size: 224 | ||
img_scale: [0.08, 1.0] | ||
interpolation: bilinear | ||
color_jitter: 0.4 | ||
auto_augment: 'rand-m9-mstd0.5-inc1' | ||
re_prob: 0.25 | ||
re_mode: 'pixel' | ||
re_count: 1 | ||
text_aug: null | ||
|
||
train: | ||
start_step: 0 | ||
total_steps: 50000 | ||
warmup_steps: 20000 | ||
ust_steps: 0 | ||
base_lr: 1.6e-3 | ||
weight_decay: 0.05 | ||
min_lr: 4e-5 | ||
clip_grad: 5.0 | ||
fp16: true | ||
fp16_comm: true # use fp16 grad compression for multi-node training | ||
seed: 0 | ||
|
||
lr_scheduler: | ||
name: cosine | ||
|
||
optimizer: | ||
name: adamw | ||
eps: 1e-8 | ||
betas: [0.9, 0.999] | ||
|
||
|
||
evaluate: | ||
pamr: false | ||
kp_w: 0.0 | ||
bg_thresh: 0.5 | ||
|
||
save_logits: null | ||
|
||
eval_only: false | ||
eval_freq: 5000 | ||
template: simple | ||
task: | ||
- voc | ||
- voc20 | ||
- context | ||
- context59 | ||
- coco_stuff | ||
- coco_object | ||
- cityscapes | ||
- ade20k | ||
|
||
|
||
checkpoint: | ||
resume: '' | ||
save_topk: 0 | ||
save_all: false # if true, save every evaluation step | ||
|
||
|
||
model_name: "default" # display name in the logger | ||
output: ??? | ||
tag: default | ||
print_freq: 20 | ||
seed: 0 | ||
wandb: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
evaluate: | ||
pamr: true | ||
bg_thresh: 0.4 | ||
kp_w: 0.3 | ||
|
||
eval_only: true | ||
template: custom | ||
task: | ||
- voc | ||
- voc20 | ||
- context | ||
- context59 | ||
- coco_stuff | ||
- coco_object | ||
- cityscapes | ||
- ade20k | ||
|
||
# training splits | ||
t_voc20: segmentation/configs/_base_/datasets/t_pascal_voc12_20.py | ||
t_context59: segmentation/configs/_base_/datasets/t_pascal_context59.py | ||
|
||
# evaluation | ||
voc: segmentation/configs/_base_/datasets/pascal_voc12.py | ||
voc20: segmentation/configs/_base_/datasets/pascal_voc12_20.py | ||
context: segmentation/configs/_base_/datasets/pascal_context.py | ||
context59: segmentation/configs/_base_/datasets/pascal_context59.py | ||
coco_stuff: segmentation/configs/_base_/datasets/stuff.py | ||
coco_object: segmentation/configs/_base_/datasets/coco.py | ||
cityscapes: segmentation/configs/_base_/datasets/cityscapes.py | ||
ade20k: segmentation/configs/_base_/datasets/ade20k.py |
Oops, something went wrong.