Skip to content

linhuixiao/Awesome-Visual-Grounding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome PR's Welcome

Towards Visual Grounding: A Survey

TPAMI under review, 2024
Linhui Xiao · Xiaoshan Yang · Xiangyuan Lan · Yaowei Wang · Changsheng Xu

arXiv PDF

An Illustration of Visual Grounding

A Decade of Visual Grounding

This repo is used for recording, tracking, and benchmarking several recent visual grounding methods to supplement our Grounding Survey.

🔥 Add Your Paper in our Repo and Survey!

  • If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.

  • You are welcome to give us an issue or PR (pull request) for your visual grounding related works!

  • Note that: Due to the huge paper in Arxiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.

🔥 New

  • Next version of our survey is expected to update in: June 1, 2025.

  • 🔥 We made our survey paper public and created this repository on December 28, 2024.

  • Our grounding work OneRef (Paper, Code) has acceptance by top conference NeurIPS 2024 in October 2024!

  • Our grounding work HiVG (Paper, Code) has acceptance by top conference ACM MM 2024 in July 2024!

  • Our grounding work CLIP-VG (Paper, Code) has acceptance by top journal TMM in September 2023!

🔥 Highlight!!

  • A comprehensive survey for Visual Grounding, including Referring Expression Comprehension and Phrase Grounding.

  • It includes the newly concepts, such as Grounding Multi-modal LLMs, Generalized Visual Grounding, and VLP-based grounding transfer works.

  • We list detailed results for the most representative works and give a fairer and clearer comparison of different approaches.

  • We provide a list of future research insights.

Introduction

we are the first survey in the past five years to systematically track and summarize the development of visual grounding over the last decade. By extracting common technical details, this review encompasses the most representative work in each subtopic.

This survey is also currently the most comprehensive review in the field of visual grounding. We aim for this article to serve as a valuable resource not only for beginners seeking an introduction to grounding but also for researchers with an established foundation, enabling them to navigate and stay up-to-date with the latest advancements.

A Decade of Visual Grounding

Mainstream Settings in Visual Grounding

Typical Framework Architectures for Visual Grounding

Our Paper Structure

Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@misc{xiao2024visualgroundingsurvey,
      title={Towards Visual Grounding: A Survey}, 
      author={Linhui Xiao and Xiaoshan Yang and Xiangyuan Lan and Yaowei Wang and Changsheng Xu},
      year={2024},
      eprint={2412.20206},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.20206}, 
}

It should be noted that, due to the typesetting restrictions of the journal, there are small differences in the typesetting between the Arxiv version and review version.

The following will be the relevant grounding papers and associated code links in this paper:

Summary of Contents

This content corresponds to the main text.

1. Methods: A Survey

1.1 Fully Supervised Setting

A. Traditional CNN-based Methods

Year Venue Work Name Paper Title / Paper Link Code / Project
2016 CVPR NMI Generation and Comprehension of Unambiguous Object Descriptions Code
2016 ECCV SNLE Segmentation from Natural Language Expressions N/A
2018 TPAMI Similarity Network Learning Two-Branch Neural Networks for Image-Text Matching Tasks N/A
2018 ECCV CITE Conditional Image-Text Embedding Networks Code
2018 IJCAI DDPN Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding Code
2014 EMNLP Referitgame Referitgame: Referring to objects in photographs of natural scenes Code
2015 CVPR DMSM From captions to visual concepts and back Project
2016 CVPR SCRC Natural language object retrieval Code
2018 ACCV PIRC Pirc net: Using proposal indexing, relationships and context for phrase grounding N/A
2016 ECCV Visdif Modeling context in referring expressions Data
2018 CVPR Mattnet Mattnet: Modular attention network for referring expression comprehension Code
2020 AAAI CMCC Learning cross-modal context graph for visual grounding code
2016 CVPR YOLO You only look once: Unified, real-time object detection Project
2018 CVPR YOLOv3 Yolov3: An incremental improvement Project
2017 ICCV Attribute Referring Expression Generation and Comprehension via Attributes N/A
2017 CVPR CG Comprehension-guided referring expressions N/A
2017 CVPR CMN Modeling relationships in referential expressions with compositional modular networks Code
2018 CVPR PLAN Parallel attention: A unifi ed framework for visual object discovery through dialogs and queries N/A
2018 CVPR VC Grounding Referring Expressions in Images by Variational Context code
2018 ArXiv SSG Real-time referring expression comprehension by single-stage grounding network N/A
2018 CVPR A-ATT Visual grounding via accumulated attention N/A
2019 ICCV DGA Dynamic Graph Attention for Referring Expression Comprehension N/A
2020 CVPR RCCF A real-time cross-modality correlation fi ltering method for referring expression comprehension N/A
2021 CVPR LBYL Look before you leap: Learning landmark features for one-stage visual grounding code
2019 CVPR CM-Att-E Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing N/A
2019 ICCV FAOA A Fast and Accurate One-Stage Approach to Visual Grounding N/A
2016 ECCV Neg Bag Modeling context between objects for referring expression understanding N/A
2020 ECCV ReSC Improving one-stage visual grounding by recursive sub-query construction Code

B. Transformer-based Methods

Year Venue Work Name Paper Title / Paper Link Code / Project
2021 ICCV TransVG Transvg: End-to-end Visual Grounding with Transformers Code
2023 TPAMI TransVG++ TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer N/A
2022 CVPR QRNet Shifting More Attention to Visual Backbone: Query-Modulated Refinement Networks for End-to-End Visual Grounding Code
2024 ACM MM MMCA Visual grounding with multimodal conditional adaptation Code

C. VLP-based Methods

Year Venue Name Paper Title / Paper Link Code / Project
2023 TMM CLIP-VG CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding Code
2023 TPAMI D-MDETR Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding Code
2022 TNNLS Word2Pix Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding Code
2023 AAAI LADS Referring Expression Comprehension Using Language Adaptive Inference N/A
2023 TIM JMRI Visual Grounding With Joint Multimodal Representation and Interaction N/A
2024 ACM MM HiVG HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding Code
2023 AAAI DQ-DETR DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding Code
2022 NeurIPS FIBER Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone Code
2022 EMNLP mPLUG mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections Code
2022 CVPR Cris Cris: Clip driven referring image segmentation Code
2024 NAACL RISCLIP Extending clip’s image-text alignment to referring image segmentation N/A

D. Grounding-oriented Pre-training

Year Venue Name Paper Title / Paper Link Code / Project
2021 ICCV MDETR Transvg: End-to-end Visual Grounding with Transformers Code
2022 ICML OFA OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework Code
2022 ECCV UniTAB UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling Code
2024 ECCV GVC Llava-grounding: Grounded visual chat with large multimodal models N/A
2022 CVPR GLIP Grounded language-image pretraining Code
2021 CVPR OVR-CNN Open-vocabulary object detection using captions Code
2021 CVPR MDETR MDETR - Modulated Detection for End-to-End Multi-Modal Understanding Code
2024 NeurIPS OneRef OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling Code
2022 ICML OFA OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework Code
2020 ECCV UNITER UNITER: UNiversal Image-TExt Representation Learning Code
2020 NeurIPS VILLA Large-Scale Adversarial Training for Vision-and-Language Representation Learning Code
2022 NeurIPS Glipv2 Glipv2: Unifying localization and vision-language understanding Code
2024 NeurIPS HIPIE Hierarchical open-vocabulary universal image segmentation Code
2023 CVPR UNINEXT Universal instance perception as object discovery and retrieval Code
2019 NeurIPS Vilbert Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks Code
2020 ICLR Vl-bert Vl-bert: Pre-training of generic visual-linguistic representations Code Project
2023 arXiv ONE-PEACE One-peace: Exploring one general representation model toward unlimited modalities Code
2022 FTCGV N/A Vision-language pre-training: Basics, recent advances, and future trends N/A
2023 MIR N/A Large-scale multi-modal pre-trained models: A comprehensive survey N/A

E. Grounding Multimodal LLMs

Year Venue Name Paper Title / Paper Link Code / Project
2023 Arxiv Shikra Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic Code
2022 NeurIPS Chinchilla Training Compute-Optimal Large Language Models N/A
2019 OpenAI GPT-2 Language Models are Unsupervised Multitask Learners N/A
2020 NeurIPS GPT-3 Language Models are Few-Shot Learners N/A
2024 ICLR Ferret Ferret: Refer And Ground Anything Anywhere At Any Granularity Code
2024 CVPR LION LION: Empowering Multimodal Large Language Model With Dual-Level Visual Knowledge Code
2022 ECCV YORO YORO - Lightweight End to End Visual Grounding Code
2022 NeurIPS Adaptformer Adaptformer: Adapting vision transformers for scalable visual recognition Code
2023 ICML Blip-2 Blip-2: Bootstrapping languageimage pre-training with frozen image encoders and large language models Code
2022 CVPR Glamm Glamm: Pixel grounding large multimodal model Code
2024 CVPR Lisa Lisa: Reasoning segmentation via large language model GitHub
2024 CVPR GSVA GSVA: Generalized segmentation via multimodal large language models GitHub
2024 CoRR UnifiedMLLM UnifiedMLLM: Enabling unified representation for multi-modal multi-tasks with large language model GitHub
2024 arXiv F-LMM F-LMM: Grounding frozen large multimodal models GitHub
2024 arXiv Vigor Vigor: Improving visual grounding of large vision language models with fine-grained reward modeling GitHub
2023 arXiv BuboGPT BuboGPT: Enabling visual grounding in multi-modal LLMs GitHub
2024 ICLR MiniGPT-4 MiniGPT-4: Enhancing vision-language understanding with advanced large language models GitHub
2024 CVPR RegionGPT RegionGPT: Towards region understanding vision language model GitHub
2024 arXiv TextHawk TextHawk: Exploring efficient fine-grained perception of multimodal large language models GitHub
2024 ACM TMM PEAR Multimodal PEAR: Chain-of-thought reasoning for multimodal sentiment analysis GitHub
2024 ECCV Grounding DINO Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection GitHub
2023 CVPR Polyformer Polyformer: Referring image segmentation as sequential polygon generation GitHub
2024 ACM TMM UniQRNet UniQRNet: Unifying referring expression grounding and segmentation with QRNet GitHub
2022 CVPR LAVT LAVT: Language-aware vision transformer for referring image segmentation GitHub
2024 NeurIPS SimVG SimVG: A simple framework for visual grounding with decoupled multi-modal fusion GitHub
2024 ICLR KOSMOS-2 GROUNDING MULTIMODAL LARGE LANGUAGE MODELS TO THE WORLD Code
2019 OpenAI GPT-2 Language Models are Unsupervised Multitask Learners N/A
2020 NeurIPS GPT-3 Language Models are Few-Shot Learners N/A
2023 Arxiv QWen-L Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond Code
2023 Arxiv Lenna Lenna:Language enhanced reasoning detection assistant Code
2023 Arxiv u-LLaVA u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model Code
2024 Arxiv Cogvlm Cogvlm: Visual expert for pretrained language models Code
2024 CVPR VistaLLM Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model N/A
2024 CORR VisCoT Visual cot: Unleashing chain-of-thought reasoning in multimodal language models Code
2024 ICLR Ferret Ferret: Refer And Ground Anything Anywhere At Any Granularity Code
2024 CVPR LION LION: Empowering Multimodal Large Language Model With Dual-Level Visual Knowledge Code
2024 COLM Ferret Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models N/A
2022 ECCV YORO YORO - Lightweight End to End Visual Grounding
2023 arXiv NExT-Chat NExT-Chat: An LMM for Chat, Detection and Segmentation Code
2023 arXiv MiniGPT-v2 MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning Code
2024 ACL G-GPT GroundingGPT: Language Enhanced Multi-modal Grounding Model Code
2024 ECCV Groma Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models Code
2023 NeurIPS VisionLLM Visionllm: Large language model is also an open-ended decoder for vision-centric tasks Code
2022 NeurIPS InstructGPT Training language models to follow instructions with human feedback Code
2023 arXiv GPT-4 Gpt-4 technical report Code
2023 arXiv Llama Llama: Open and efficient foundation language models Code
2023 JMLR Palm Palm: Scaling language modeling with pathways Code
2023 N/A Alpaca Stanford alpaca: An instruction-following llama model Code Project
2023 arXiv N/A Instruction tuning with gpt-4 Code Project
2023 NeurIPS KOSMOS-1 Language is not all you need: Aligning perception with language models Code
2024 TMLR Dinov2 Dinov2: Learning robust visual features without supervision Code

1.2 Weakly Supervised Setting

Year Venue Name Paper Title / Paper Link Code / Project
2016 ECCV GroundR Grounding of Textual Phrases in Images by Reconstruction N/A
2017 CVPR N/A Weakly-supervised Visual Grounding of Phrases with Linguistic Structures N/A
2014 EMNLP Glove GloVe: Global Vectors for Word Representation Project
2015 CVPR N/A Deep Visual-Semantic Alignments for Generating Image Descriptions Project Code
2016 ECCV GroundR Grounding of textual phrases in images by reconstruction N/A
2017 ICCV Mask R-CNN Mask R-CNN Code
2017 ICCV Grad-CAM Grad-CAM:Visual Explanations from Deep Networks via Gradient-based Localization Code
2018 CVPR KAC Knowledge Aided Consistency for Weakly Supervised Phrase Grounding Code
2018 arXiv CPC Representation learning with contrastive predictive coding Code
2019 ACM MM KPRN Knowledgeguided pairwise reconstruction network for weakly supervised referring expression grounding Code
2021 ICCV GbS Detector-free weakly supervised grounding by separation Code
2021 TPAMI DTWREG Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding Code
2021 CVPR ReIR Relation-aware Instance Refinement for Weakly Supervised Visual Grounding Code
2022 ICML BLIP BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Code
2022 CVPR Mask2Former Masked-attention Mask Transformer for Universal Image Segmentation Project Code
2023 ACM MM CACMRN Client-adaptive cross-model reconstruction network for modality-incomplete multimodal federated learning N/A
2023 CVPR g++ Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding Code
2023 CVPR RefCLIP RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension Code
2024 TOMM UR Universal Relocalizer for Weakly Supervised Referring Expression Grounding N/A
2024 ICASSP VPT-WSVG Visual prompt tuning for weakly supervised phrase grounding N/A
2024 MMM PPT Part-Aware Prompt Tuning for Weakly Supervised Referring Expression Grounding N/A
2016 ECCV GroundR Grounding of Textual Phrases in Images by Reconstruction N/A
2018 CVPR MATN Weakly Supervised Phrase Localization With Multi-Scale Anchored Transformer Network N/A
2019 ICCV ARN Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding Code
2019 ICCV Align2Ground Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment N/A
2020 ECCV info-ground Contrastive Learning for Weakly Supervised Phrase Grounding Project
2020 EMNLP MAF MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding Code
2020 NeurIPS CCL Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding N/A
2021 CVPR NCE-Distill Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation N/A
2022 TPAMI EARN Entity-Enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding Code
2022 IMCL X-VLM Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts Code
2023 TMM DRLF A Dual Reinforcement Learning Framework for Weakly Supervised Phrase Grounding N/A
2023 TIP Cycle Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations Code
2023 ICRA TGKD Weakly Supervised Referring Expression Grounding via Target-Guided Knowledge Distillation Code
2023 ICCV CPL Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding Code
2024 CVPR RSMPL Regressor-Segmenter Mutual Prompt Learning for Crowd Counting Code
2024 TCSVT PSRN Progressive Semantic Reconstruction Network for Weakly Supervised Referring Expression Grounding Code
2024 ACM MM QueryMatch QueryMatch: A Query-based Contrastive Learning Framework for Weakly Supervised Visual Grounding Code

1.3 Semi-supervised Setting

Year Venue Name Paper Title / Paper Link Code / Project
2023 ICASSP PQG-Distil Pseudo-Query Generation For Semi-Supervised Visual Grounding With Knowledge Distillation N/A
2021 WACV LSEP Utilizing Every Image Object for Semi-supervised Phrase Grounding N/A
2022 CRV SS-Ground Semi-supervised Grounding Alignment for Multi-modal Feature Learning N/A
2021 AAAI Curriculum Labeling Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning Code
2023 ICASSP PQG-Distil Pseudo-Query Generation For Semi-Supervised Visual Grounding With Knowledge Distillation N/A
2024 CoRR ACTRESS Actress: Active retraining for semi-supervised visual grounding N/A
2021 WACV LSEP Utilizing Every Image Object for Semi-supervised Phrase Grounding N/A
2022 CRV SS-Ground Semi-supervised Grounding Alignment for Multi-modal Feature Learning N/A
2019 IJCAI N/A Learning unsupervised visual grounding through semantic self-supervision N/A
2019 ICCV N/A Phrase Localization Without Paired Training Examples N/A
2022 CVPR Pseudo-Q Pseudo-q: Generating pseudo language queries for visual grounding Code
2023 Neurocomputing BiCM Unpaired referring expression grounding via bidirectional cross-modal matching N/A
2024 Neurocomputing N/A Self-training: A survey N/A
2024 CVPR Omni-q Omni-q: Omni-directional scene understanding for unsupervised visual grounding N/A
2018 CVPR N/A Unsupervised Textual Grounding: Linking Words to Image Concepts N/A

1.4 Unsupervised Setting

Year Venue Name Paper Title / Paper Link Code / Project
2022 CVPR Pseudo-Q Pseudo-Q: Generating pseudo language queries for visual grounding Code
2018 CVPR N/A Unsupervised Textual Grounding: Linking Words to Image Concepts N/A
2023 TMM CLIP-VG CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding Code
2024 ICME VG-annotator VG-Annotator: Vision-Language Models as Query Annotators for Unsupervised Visual Grounding N/A
2023 TMM CLIPREC CLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression Comprehension N/A
2019 IJCAI N/A Learning unsupervised visual grounding through semantic self-supervision N/A
2019 ICCV N/A Phrase Localization Without Paired Training Examples N/A
2022 CVPR Pseudo-Q Pseudo-q: Generating pseudo language queries for visual grounding Code
2023 Neurocomputing BiCM Unpaired referring expression grounding via bidirectional cross-modal matching N/A
2024 Neurocomputing N/A Self-training: A survey N/A
2024 CVPR Omni-q Omni-q: Omni-directional scene understanding for unsupervised visual grounding N/A
2018 CVPR N/A Unsupervised Textual Grounding: Linking Words to Image Concepts N/A

1.5 Zero-shot Setting

Year Venue Name Paper Title / Paper Link Code / Project
2019 ICCV ZSGNet Zero-shot Grounding of Objects from Natural Language Queries Code
2022 ACL ReCLIP ReCLIP: A Strong Zero-shot Baseline for Referring Expression Comprehension Code
2024 Neurocomputing OV-VG OV-VG: A Benchmark for Open-Vocabulary Visual Grounding Code
2023 TMM CLIPREC CLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression Comprehension N/A
2024 Neurocomputing N/A Zero-shot visual grounding via coarse-to-fine representation learning Code
2022 Arxiv adapting-CLIP Adapting CLIP For Phrase Localization Without Further Training Code
2023 ICLR ChatRef Language models can do zero-shot visual referring expression comprehension Code
2024 AI Open Cpt CPT: Colorful Prompt Tuning for pre-trained vision-language models Code
2021 CVPR VinVL VinVL: Revisiting Visual Representations in Vision-Language Models Code
2024 CVPR VR-VLA Zero-shot referring expression comprehension via structural similarity between images and captions Code
2024 AAAI GroundVLP Groundvlp: Harnessing zeroshot visual grounding from vision-language pre-training and openvocabulary object detection Code
2024 TCSVT MCCE-REC MCCE-REC: MLLM-driven Cross-modal Contrastive Entropy Model for Zero-shot Referring Expression Comprehension N/A
2024 ECCV CRG Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training Code
2024 IJCNN PSAIR Psair: A neurosymbolic approach to zero-shot visual grounding N/A
2024 TPAMI TransCP Context disentangling and prototype inheriting for robust visual grounding Code
2024 TPAMI N/A Towards Open Vocabulary Learning: A Survey Code
2024 CVPR GEM Grounding everything: Emerging localization properties in vision-language transformers Code
2023 Arxiv GRILL Grill: Grounded vision-language pre-training via aligning text and image regions N/A
2017 ICCV Grad-CAM Grad-CAM:Visual Explanations from Deep Networks via Gradient-based Localization Code
2019 ICCV ZSGNet Zero-shot grounding of objects from natural language queries Code
2022 ACL ReCLIP Reclip: A strong zero-shot baseline for referring expression comprehension Code
2022 CVPR GLIP Grounded language-image pretraining Code
2022 AAAI MMKG Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations N/A
2021 CVPR OVR-CNN Open-vocabulary object detection using captions Code
2024 ICLR KOSMOS-2 GROUNDING MULTIMODAL LARGE LANGUAGE MODELS TO THE WORLD Code
2024 Neurocomputing OV-VG OV-VG: A Benchmark for Open-Vocabulary Visual Grounding Code

1.6 Multi-task Setting

A. REC with REG Multi-task Setting

Year Venue Name Paper Title / Paper Link Code / Project
2024 Arxiv VLM-VG Learning visual grounding from generative vision and language model N/A
2024 Arxiv EEVG An efficient and effective transformer decoder-based framework for multi-task visual grounding Code
2006 INLGC N/A Building a Semantically Transparent Corpus for the Generation of Referring Expressions Project
2010 ACL N/A Natural reference to objects in a visual domain Code
2012 CL Survey Computational generation of referring expressions: A survey N/A
2013 NAACL N/A Generating expressions that refer to visible object Code
2016 CVPR NMI Generation and comprehension of unambiguous object descriptions Code
2017 ICCV Attribute Referring Expression Generation and Comprehension via Attributes N/A
2017 CVPR SLR A Joint Speaker-Listener-Reinforcer Model for Referring Expressions N/A
2017 CVPR CG Comprehension-guided referring expressions N/A
2024 AAAI CyCo A Joint Speaker-Listener-Reinforcer Model for Referring Expressions N/A

B. REC with RES Multi-task Setting

Year Venue Name Paper Title / Paper Link Code / Project
2020 CVPR MCN Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation code
2021 NeurIPS RefTR Referring Transformer: A One-step Approach to Multi-task Visual Grounding code
2022 ECCV SeqTR SeqTR: A Simple yet Universal Network for Visual Grounding code
2023 CVPR VG-LAW Language Adaptive Weight Generation for Multi-task Visual Grounding code
2024 Neurocomputing M2IF Improving visual grounding with multi-modal interaction and auto-regressive vertex generation Code

C. Other Multi-task Setting

Year Venue Name Paper Title / Paper Link Code / Project
2016 EMNLP MCB Multimodal compact bilinear pooling for visual question answering and visual groundin Code
2024 CVPR RefCount Referring expression counting Code
2022 CVPR VizWiz-VQA-Grounding Grounding Answers for Visual Questions Asked by Visually Impaired People Project
2022 ECCV N/A Weakly supervised grounding for VQA in vision-language transformers Code
2020 ACL N/A A Negative Case Analysis of Visual Grounding Methods for VQA Code
2024 Arxiv TrueVG Uncovering the Full Potential of Visual Grounding Methods in VQA Code
2020 IVC N/A Explaining VQA predictions using visual grounding and a knowledge base N/A
2019 CVPR N/A Multi-task Learning of Hierarchical Vision-Language Representation N/A

1.7 Generalized Visual Grounding

Year Venue Name Paper Title / Paper Link Code / Project
2021 CVPR OVR-CNN Open-Vocabulary Object Detection Using Captions Code
2021 ICCV VLT Vision-Language Transformer and Query Generation for Referring Segmentation Code
2023 Arxiv GREC GREC:Generalized Referring Expression Comprehension Code
2024 EMNLP RECANTFormer Recantformer: Referring expression comprehension with varying numbers of targets N/A
2023 CVPR gRefCOCO GRES: Generalized Referring Expression Segmentation Code
2024 ICCV Ref-ZOM Beyond One-to-One: Rethinking the Referring Image Segmentation Code

2. Advanced Topics

2.1 NLP Language Structure Parsing in Visual Grounding

Year Venue Name Paper Title / Paper Link Code / Project
2019 ICCV NMTree Learning to assemble neural module tree networks for visual grounding N/A
2017 CVPR CMN Modeling relationships in referential expressions with compositional modular networks Code
2015 EMNLP N/A An improved non-monotonic transitionsystem for dependency parsing N/A
2014 EMNLP N/A A fast and accurate dependency parser using neural networks N/A
2020 NSP NLPPython Natural language processing with Python and spaCy: A practical introduction N/A
2020 Arxiv Atanza Stanza: A Python Natural Language Processing Toolkit for Many Human Languages Project
2016 ECCV N/A Structured matching for phrase localization N/A
2017 ICCV N/A Phrase localization and visual relationship detection with comprehensive image-language cues Code
2022 CVPR GLIP Grounded language-image pretraining Code
2017 ICCV QRC Net Query-guided regression network with context policy for phrase grounding N/A
2006 ACL NLTK Nltk: the natural language toolkit Code
2019 SNAMS OpenNLP A Replicable Comparison Study of NER Software:StanfordNLP,NLTK, OpenNLP, SpaCy, Gate N/A
2018 Packt Gensim Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras N/A
2013 ACL Keras Parsing with compositional vector grammars N/A
2018 AAAI GroundNet Using Syntax to GroundReferring Expressions in Natural Images Code
2019 TPAMI RVGTree Learning to Compose and Reason with Language Tree Structures for Visual Grounding N/A
2024 CVPR ARPGrounding Investigating Compositional Challenges in Vision-Language Models for Visual Grounding N/A

2.2 Spatial Relation and Graph Networks

Year Venue Name Paper Title / Paper Link Code / Project
2023 TMM CLIPREC CLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression Comprehension N/A
2024 ACM MM ResVG ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding code
2023 Arxiv Shikra Code
2023 ACM MM TAGRL Towards adaptable graph representation learning: An adaptive multi-graph contrastive transformer N/A
2020 AAAI CMCC Learning cross-modal context graph for visual grounding code
2019 CVPR LGRANs Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks N/A
2019 CVPR CMRIN Cross-Modal Relationship Inference for Grounding Referring Expressions N/A
2019 ICCV DGA Dynamic Graph Attention for Referring Expression Comprehension N/A
2024 TPAMI N/A A Survey on Graph Neural Networks and GraphTransformers in Computer Vision: A Task-Oriented Perspective N/A

2.3 Modular Grounding

Year Venue Name Paper Title / Paper Link Code / Project
2018 CVPR Mattnet Mattnet: Modular attention network for referring expression comprehension Code
2017 CVPR CMN Modeling relationships in referential expressions with compositional modular networks Code
2016 CVPR NMN Neural Module Networks code
2019 CVPR MTGCR Modularized Textual Grounding for Counterfactual Resilience N/A

3. Applications

Year Venue Name Paper Title / Paper Link Code / Project
2019 CVPR CAGDC Context and Attribute Grounded Dense Captioning N/A

3.1 Grounded Object Detection

Year Venue Name Paper Title / Paper Link Code / Project
2024 NeurlPS MQ-Det Multi-modal queried object detection in the wild code
2023 Arxiv Shikra Code
2022 CVPR GLIP Grounded language-image pretraining Code
2024 CVPR ScanFormer ScanFormer: Referring Expression Comprehension by Iteratively Scanning N/A
2024 Arxiv Ref-L4 Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models code

3.2 Referring Counting

Year Venue Name Paper Title / Paper Link Code / Project
2024 CVPR RefCount Referring expression counting Code

3.3 Remote Sensing Visual Grounding

Year Venue Name Paper Title / Paper Link Code / Project
2024 TGRS Rrsis “Rrsis: Referring remote sensing image segmentation code
2024 TGRS LQVG Language query based transformer with multi-scale cross-modal alignment for visual grounding on remote sensing images code
2024 TGRS RINet A regionally indicated visual grounding network for remote sensing images code
2024 GRSL MSAM Multi-stage synergistic aggregation network for remote sensing visual grounding code
2024 GRSL VSMR Visual selection and multi-stage reasoning for rsvg N/A
2024 TGRS LPVA Language-guided progressive attention for visual grounding in remote sensing images code
2024 Arxiv GeoGround GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding code
2023 TGRS RSVG RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data N/A
2022 ACM MM RSVG Visual grounding in remote sensing images code

3.4 Medical Visual Grounding

Year Venue Name Paper Title / Paper Link Code / Project
2023 MICCAI MedRPG Medical Grounding with Region-Phrase Context Contrastive Alignment N/A
2024 Arxiv PFMVG Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding unavailable
2022 ECCV CXR-BERT Making the most of text semantics to improve biomedical vision–language processing code
2017 CVPR ChestX-ray8 Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases N/A
2019 Arxiv MIMIC-CXR-JPG MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs Code
2024 Arxiv MedRG MedRG: Medical Report Grounding with Multi-modal Large Language Model N/A
2024 Arxiv VividMed VividMed: Vision Language Model with Versatile Visual Grounding for Medicine Code
2023 Arxiv ViLaM ViLaM: A Vision-Language Model with Enhanced Visual Grounding and Generalization Capability Code

3.5 3D Visual Grounding

Year Venue Name Paper Title / Paper Link Code / Project
2022 CVPR 3D-SPS 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection Code
2021 ACMMM TransRefer3D TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding Code
2020 ECCV Scanrefer Scanrefer: 3d object localization in rgb-d scans using natural language Code
2020 ECCV ReferIt3D ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes Code
2024 Arxiv - A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions N/A

3.6 Video Object Grounding

Year Venue Name Paper Title / Paper Link Code / Project
2020 CVPR VOGNet Video object grounding using semantic roles in language description Code
2024 Arxiv - Described Spatial-Temporal Video Detection unavailable
2023 TOMM - A survey on temporal sentence grounding in videos N/A
2023 TPAMI - Temporal sentence grounding in videos: A survey and future directions N/A
2024 CVPR MC-TTA Modality-Collaborative Test-Time Adaptation for Action Recognition N/A
2023 CVPR TransRMOT Referring multi-object tracking code

3.7 Robotic and Multimodal Agent Applications

Year Venue Name Paper Title / Paper Link Code / Project
2018 CVPR VLN Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments Data
2019 RAS Dynamic-SLAM Dynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environment Code
2019 WCSP N/A Integrated Wearable Indoor Positioning System Based On Visible Light Positioning And Inertial Navigation Using Unscented Kalman Filter N/A
2019 ICRA Ground then Navigate Ground then Navigate: Language-guided Navigation in Dynamic Scenes Code
2023 MEAS SCI TECHNOL FDO-Calibr FDO-Calibr: visual-aided IMU calibration based on frequency-domain optimization N/A
2024 arxiv HiFi-CS Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models N/A
2025 ECCV Ferret-UI Grounded Mobile UI Understanding with Multimodal LLMs N/A

4. Datasets and Benchmarks

3.1 The Five Datasets for Classical Visual Grounding

Year Venue Name Paper Title / Paper Link Code / Project
2010 CVIU N/A The segmented and annotated iapr tc-12 benchmark N/A
2014 ECCV MS COCO Microsoft COCO: Common Objects in Context Project
2014 TACL N/A From image descriptions to visual denotations:Newsimilarity metrics for semantic inference over event descriptions N/A
2015 ICCV Flickr30k Entities Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models Code
2016 ECCV RefCOCOg-umd Modeling context between objects for referring expression understanding N/A
2016 CVPR RefCOCOg-g Generation and comprehension of unambiguous object descriptions Code
2016 ECCV RefCOCO/+ Modeling context in referring expressions Data
2017 IJCV Visual genome Visual genome: Connecting language and vision using crowdsourced dense image annotations N/A
2019 CVPR TD-SDR TOUCHDOWN:NaturalLanguageNavigationandSpatialReasoning inVisualStreetEnvironments Code
2019 CVPR CLEVR CLEVR:ADiagnostic Dataset for Compositional Language and Elementary Visual Reasoning Code
2020 CVPR REVERIE REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments Code
2020 CVPR PANDA PANDA: AGigapixel-level Human-centric Video Dataset Code
2024 arxiv DINO-X DINO-X:AUnifiedVisionModelfor Open-WorldObjectDetectionandUnderstanding Code
2024 arxiv MC-Bench MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs Code
2025 arxiv T-Rex2 T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy Code

3.2 The Other Datasets for Classical Visual Grounding

Year Venue Name Paper Title / Paper Link Code / Project
2024 Arxiv VLM-VG Learning visual grounding from generative vision and language model N/A
2011 NeurIPS SBU Im2text: Describing images using 1 million captioned photographs N/A
2016 CVPR Visual7W Visual7W: Grounded Question Answering in Images Code
2017 CVPR GuessWhat?! GuessWhat?! Visual object discovery through multi-modal dialogue
2018 ACL CC3M Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning Code
2019 CVPR Clevr-ref+ Clevr-ref+: Diagnosing visual reasoning with referring expressions Code
2019 arxiv Object365 Objects as Points Code
2020 IJCV Open Image The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale Code
2020 CVPR Cops-ref Cops-ref: A new dataset and task on compositional referring expression comprehension Code
2020 ACL Refer360 Refer360: A referring expression recognition dataset in 360 images Code
2021 CVPR CC12M Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts Code
2023 ICCV SAM Segment Anything Code

3.3 Dataset for the Newly Curated Scenarios

Year Venue Name Paper Title / Paper Link Code / Project
2024 NeurIPS D$^3$ Described Object Detection: Liberating Object Detection with Flexible Expressions Code

A. Dataset for Generalized Visual Grounding

Year Venue Name Paper Title / Paper Link Code / Project
2023 CVPR gRefCOCO GRES: Generalized Referring Expression Segmentation Code
2024 ICCV Ref-ZOM Beyond One-to-One: Rethinking the Referring Image Segmentation Code

B. Datasets and Benchmarks for GMLLMs

Year Venue Name Paper Title / Paper Link Code / Project
2024 NeurIPS HC-RefLoCo A Large-Scale Human-Centric Benchmark for Referring Expression Comprehension in the LMM Era Code
2024 ECCV GVC Llava-grounding: Grounded visual chat with large multimodal models N/A
2024 ICLR KOSMOS-2 GROUNDING MULTIMODAL LARGE LANGUAGE MODELS TO THE WORLD Code

C. Dataset for Other Newly Curated Scenarios

Year Venue Name Paper Title / Paper Link Code / Project
2024 CVPR GigaGround When visual grounding meets gigapixel-level large-scale scenes: Benchmark and approach Code

5. Challenges And Outlook

Year Venue Name Paper Title / Paper Link Code / Project
2024 Arxiv - AI Models Collapse When Trained on Recursively Generated Data N/A
2024 CVPR RefCount Referring expression counting Code
2024 CVPR GigaGround When visual grounding meets gigapixel-level large-scale scenes: Benchmark and approach Code
2022 CVPR GLIP Grounded language-image pretraining Code

6. Other Valuable Survey and Project

Year Venue Name Paper Title / Paper Link Code / Project
2018 TPAMI N/A Multimodal machine learning: A survey and taxonomy N/A
2020 TMM N/A Referring expression comprehension: A survey of methods and datasets N/A
2021 Github awesome-grounding N/A Project
2023 TPAMI Awesome-Open-Vocabulary Towards Open Vocabulary Learning: A Survey Project
2023 TPAMI N/A Multimodal learning with transformers: A survey N/A
2024 Github awesome-described-object-detection N/A awesome-described-object-detection

Acknowledgement

This survey took half a year to complete, and the process was laborious and burdensome.

Building up this GitHub repository also required significant effort. We would like to thank the following individuals for their contributions to completing this project: Baochen Xiong, Yifan Xu, Yaguang Song, Menghao Hu, Han Jiang, Hao Liu, Chenlin Zhao, Fang Peng, Xudong Yao, Zibo Shao, Kaichen Li, Jianhao Huang, Xianbing Yang, Shuaitong Li, Jisheng Yin, Yupeng Wu, Shaobo Xie, etc.

Contact

Email: [email protected]. Any kind discussions are welcomed!

Star History

Star History Chart

Releases

No releases published

Packages

No packages published

Languages