Towards Visual Grounding: A Survey

TPAMI under review, 2024
Linhui Xiao · Xiaoshan Yang · Xiangyuan Lan · Yaowei Wang · Changsheng Xu

An Illustration of Visual Grounding

A Decade of Visual Grounding

This repo is used for recording, tracking, and benchmarking several recent visual grounding methods to supplement our Grounding Survey.

🔥 Add Your Paper in our Repo and Survey!

If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.
You are welcome to give us an issue or PR (pull request) for your visual grounding related works!
Note that: Due to the huge paper in Arxiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.

🔥 New

Next version of our survey is expected to update in: June 1, 2025.
🔥 We made our survey paper public and created this repository on December 28, 2024.
Our grounding work OneRef (Paper, Code) has acceptance by top conference NeurIPS 2024 in October 2024!
Our grounding work HiVG (Paper, Code) has acceptance by top conference ACM MM 2024 in July 2024!
Our grounding work CLIP-VG (Paper, Code) has acceptance by top journal TMM in September 2023!

🔥 Highlight!!

A comprehensive survey for Visual Grounding, including Referring Expression Comprehension and Phrase Grounding.
It includes the newly concepts, such as Grounding Multi-modal LLMs, Generalized Visual Grounding, and VLP-based grounding transfer works.
We list detailed results for the most representative works and give a fairer and clearer comparison of different approaches.
We provide a list of future research insights.

Introduction

we are the first survey in the past five years to systematically track and summarize the development of visual grounding over the last decade. By extracting common technical details, this review encompasses the most representative work in each subtopic.

This survey is also currently the most comprehensive review in the field of visual grounding. We aim for this article to serve as a valuable resource not only for beginners seeking an introduction to grounding but also for researchers with an established foundation, enabling them to navigate and stay up-to-date with the latest advancements.

A Decade of Visual Grounding

Mainstream Settings in Visual Grounding

Typical Framework Architectures for Visual Grounding

Our Paper Structure

Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@misc{xiao2024visualgroundingsurvey,
      title={Towards Visual Grounding: A Survey}, 
      author={Linhui Xiao and Xiaoshan Yang and Xiangyuan Lan and Yaowei Wang and Changsheng Xu},
      year={2024},
      eprint={2412.20206},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.20206}, 
}

It should be noted that, due to the typesetting restrictions of the journal, there are small differences in the typesetting between the Arxiv version and review version.

The following will be the relevant grounding papers and associated code links in this paper:

Summary of Contents

This content corresponds to the main text.

Introduction
- Citation
Summary of Contents
1. Methods: A Survey
2. Advanced Topics
3. Applications
4. Datasets and Benchmarks
5. Challenges And Outlook
6. Other Valuable Survey and Project
Acknowledgement
Contact
- Star History

1. Methods: A Survey

1.1 Fully Supervised Setting

A. Traditional CNN-based Methods

Year	Venue	Work Name	Paper Title / Paper Link	Code / Project
2016	CVPR	NMI	Generation and Comprehension of Unambiguous Object Descriptions	Code
2016	ECCV	SNLE	Segmentation from Natural Language Expressions	N/A
2018	TPAMI	Similarity Network	Learning Two-Branch Neural Networks for Image-Text Matching Tasks	N/A
2018	ECCV	CITE	Conditional Image-Text Embedding Networks	Code
2018	IJCAI	DDPN	Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding	Code
2014	EMNLP	Referitgame	Referitgame: Referring to objects in photographs of natural scenes	Code
2015	CVPR	DMSM	From captions to visual concepts and back	Project
2016	CVPR	SCRC	Natural language object retrieval	Code
2018	ACCV	PIRC	Pirc net: Using proposal indexing, relationships and context for phrase grounding	N/A
2016	ECCV	Visdif	Modeling context in referring expressions	Data
2018	CVPR	Mattnet	Mattnet: Modular attention network for referring expression comprehension	Code
2020	AAAI	CMCC	Learning cross-modal context graph for visual grounding	code
2016	CVPR	YOLO	You only look once: Unified, real-time object detection	Project
2018	CVPR	YOLOv3	Yolov3: An incremental improvement	Project
2017	ICCV	Attribute	Referring Expression Generation and Comprehension via Attributes	N/A
2017	CVPR	CG	Comprehension-guided referring expressions	N/A
2017	CVPR	CMN	Modeling relationships in referential expressions with compositional modular networks	Code
2018	CVPR	PLAN	Parallel attention: A unifi ed framework for visual object discovery through dialogs and queries	N/A
2018	CVPR	VC	Grounding Referring Expressions in Images by Variational Context	code
2018	ArXiv	SSG	Real-time referring expression comprehension by single-stage grounding network	N/A
2018	CVPR	A-ATT	Visual grounding via accumulated attention	N/A
2019	ICCV	DGA	Dynamic Graph Attention for Referring Expression Comprehension	N/A
2020	CVPR	RCCF	A real-time cross-modality correlation fi ltering method for referring expression comprehension	N/A
2021	CVPR	LBYL	Look before you leap: Learning landmark features for one-stage visual grounding	code
2019	CVPR	CM-Att-E	Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing	N/A
2019	ICCV	FAOA	A Fast and Accurate One-Stage Approach to Visual Grounding	N/A
2016	ECCV	Neg Bag	Modeling context between objects for referring expression understanding	N/A
2020	ECCV	ReSC	Improving one-stage visual grounding by recursive sub-query construction	Code

B. Transformer-based Methods

Year	Venue	Work Name	Paper Title / Paper Link	Code / Project
2021	ICCV	TransVG	Transvg: End-to-end Visual Grounding with Transformers	Code
2023	TPAMI	TransVG++	TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer	N/A
2022	CVPR	QRNet	Shifting More Attention to Visual Backbone: Query-Modulated Refinement Networks for End-to-End Visual Grounding	Code
2024	ACM MM	MMCA	Visual grounding with multimodal conditional adaptation	Code

C. VLP-based Methods

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2023	TMM	CLIP-VG	CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding	Code
2023	TPAMI	D-MDETR	Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding	Code
2022	TNNLS	Word2Pix	Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding	Code
2023	AAAI	LADS	Referring Expression Comprehension Using Language Adaptive Inference	N/A
2023	TIM	JMRI	Visual Grounding With Joint Multimodal Representation and Interaction	N/A
2024	ACM MM	HiVG	HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding	Code
2023	AAAI	DQ-DETR	DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding	Code
2022	NeurIPS	FIBER	Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone	Code
2022	EMNLP	mPLUG	mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections	Code
2022	CVPR	Cris	Cris: Clip driven referring image segmentation	Code
2024	NAACL	RISCLIP	Extending clip’s image-text alignment to referring image segmentation	N/A

D. Grounding-oriented Pre-training

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2021	ICCV	MDETR	Transvg: End-to-end Visual Grounding with Transformers	Code
2022	ICML	OFA	OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework	Code
2022	ECCV	UniTAB	UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling	Code
2024	ECCV	GVC	Llava-grounding: Grounded visual chat with large multimodal models	N/A
2022	CVPR	GLIP	Grounded language-image pretraining	Code
2021	CVPR	OVR-CNN	Open-vocabulary object detection using captions	Code
2021	CVPR	MDETR	MDETR - Modulated Detection for End-to-End Multi-Modal Understanding	Code
2024	NeurIPS	OneRef	OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling	Code
2022	ICML	OFA	OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework	Code
2020	ECCV	UNITER	UNITER: UNiversal Image-TExt Representation Learning	Code
2020	NeurIPS	VILLA	Large-Scale Adversarial Training for Vision-and-Language Representation Learning	Code
2022	NeurIPS	Glipv2	Glipv2: Unifying localization and vision-language understanding	Code
2024	NeurIPS	HIPIE	Hierarchical open-vocabulary universal image segmentation	Code
2023	CVPR	UNINEXT	Universal instance perception as object discovery and retrieval	Code
2019	NeurIPS	Vilbert	Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks	Code
2020	ICLR	Vl-bert	Vl-bert: Pre-training of generic visual-linguistic representations	Code Project
2023	arXiv	ONE-PEACE	One-peace: Exploring one general representation model toward unlimited modalities	Code
2022	FTCGV	N/A	Vision-language pre-training: Basics, recent advances, and future trends	N/A
2023	MIR	N/A	Large-scale multi-modal pre-trained models: A comprehensive survey	N/A

E. Grounding Multimodal LLMs

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2023	Arxiv	Shikra	Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	Code
2022	NeurIPS	Chinchilla	Training Compute-Optimal Large Language Models	N/A
2019	OpenAI	GPT-2	Language Models are Unsupervised Multitask Learners	N/A
2020	NeurIPS	GPT-3	Language Models are Few-Shot Learners	N/A
2024	ICLR	Ferret	Ferret: Refer And Ground Anything Anywhere At Any Granularity	Code
2024	CVPR	LION	LION: Empowering Multimodal Large Language Model With Dual-Level Visual Knowledge	Code
2022	ECCV	YORO	YORO - Lightweight End to End Visual Grounding	Code
2022	NeurIPS	Adaptformer	Adaptformer: Adapting vision transformers for scalable visual recognition	Code
2023	ICML	Blip-2	Blip-2: Bootstrapping languageimage pre-training with frozen image encoders and large language models	Code
2022	CVPR	Glamm	Glamm: Pixel grounding large multimodal model	Code
2024	CVPR	Lisa	Lisa: Reasoning segmentation via large language model	GitHub
2024	CVPR	GSVA	GSVA: Generalized segmentation via multimodal large language models	GitHub
2024	CoRR	UnifiedMLLM	UnifiedMLLM: Enabling unified representation for multi-modal multi-tasks with large language model	GitHub
2024	arXiv	F-LMM	F-LMM: Grounding frozen large multimodal models	GitHub
2024	arXiv	Vigor	Vigor: Improving visual grounding of large vision language models with fine-grained reward modeling	GitHub
2023	arXiv	BuboGPT	BuboGPT: Enabling visual grounding in multi-modal LLMs	GitHub
2024	ICLR	MiniGPT-4	MiniGPT-4: Enhancing vision-language understanding with advanced large language models	GitHub
2024	CVPR	RegionGPT	RegionGPT: Towards region understanding vision language model	GitHub
2024	arXiv	TextHawk	TextHawk: Exploring efficient fine-grained perception of multimodal large language models	GitHub
2024	ACM TMM	PEAR	Multimodal PEAR: Chain-of-thought reasoning for multimodal sentiment analysis	GitHub
2024	ECCV	Grounding DINO	Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection	GitHub
2023	CVPR	Polyformer	Polyformer: Referring image segmentation as sequential polygon generation	GitHub
2024	ACM TMM	UniQRNet	UniQRNet: Unifying referring expression grounding and segmentation with QRNet	GitHub
2022	CVPR	LAVT	LAVT: Language-aware vision transformer for referring image segmentation	GitHub
2024	NeurIPS	SimVG	SimVG: A simple framework for visual grounding with decoupled multi-modal fusion	GitHub
2024	ICLR	KOSMOS-2	GROUNDING MULTIMODAL LARGE LANGUAGE MODELS TO THE WORLD	Code
2019	OpenAI	GPT-2	Language Models are Unsupervised Multitask Learners	N/A
2020	NeurIPS	GPT-3	Language Models are Few-Shot Learners	N/A
2023	Arxiv	QWen-L	Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond	Code
2023	Arxiv	Lenna	Lenna:Language enhanced reasoning detection assistant	Code
2023	Arxiv	u-LLaVA	u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model	Code
2024	Arxiv	Cogvlm	Cogvlm: Visual expert for pretrained language models	Code
2024	CVPR	VistaLLM	Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model	N/A
2024	CORR	VisCoT	Visual cot: Unleashing chain-of-thought reasoning in multimodal language models	Code
2024	ICLR	Ferret	Ferret: Refer And Ground Anything Anywhere At Any Granularity	Code
2024	CVPR	LION	LION: Empowering Multimodal Large Language Model With Dual-Level Visual Knowledge	Code
2024	COLM	Ferret	Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models	N/A
2022	ECCV	YORO	YORO - Lightweight End to End Visual Grounding
2023	arXiv	NExT-Chat	NExT-Chat: An LMM for Chat, Detection and Segmentation	Code
2023	arXiv	MiniGPT-v2	MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	Code
2024	ACL	G-GPT	GroundingGPT: Language Enhanced Multi-modal Grounding Model	Code
2024	ECCV	Groma	Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	Code
2023	NeurIPS	VisionLLM	Visionllm: Large language model is also an open-ended decoder for vision-centric tasks	Code
2022	NeurIPS	InstructGPT	Training language models to follow instructions with human feedback	Code
2023	arXiv	GPT-4	Gpt-4 technical report	Code
2023	arXiv	Llama	Llama: Open and efficient foundation language models	Code
2023	JMLR	Palm	Palm: Scaling language modeling with pathways	Code
2023	N/A	Alpaca	Stanford alpaca: An instruction-following llama model	Code Project
2023	arXiv	N/A	Instruction tuning with gpt-4	Code Project
2023	NeurIPS	KOSMOS-1	Language is not all you need: Aligning perception with language models	Code
2024	TMLR	Dinov2	Dinov2: Learning robust visual features without supervision	Code

1.2 Weakly Supervised Setting

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2016	ECCV	GroundR	Grounding of Textual Phrases in Images by Reconstruction	N/A
2017	CVPR	N/A	Weakly-supervised Visual Grounding of Phrases with Linguistic Structures	N/A
2014	EMNLP	Glove	GloVe: Global Vectors for Word Representation	Project
2015	CVPR	N/A	Deep Visual-Semantic Alignments for Generating Image Descriptions	Project Code
2016	ECCV	GroundR	Grounding of textual phrases in images by reconstruction	N/A
2017	ICCV	Mask R-CNN	Mask R-CNN	Code
2017	ICCV	Grad-CAM	Grad-CAM:Visual Explanations from Deep Networks via Gradient-based Localization	Code
2018	CVPR	KAC	Knowledge Aided Consistency for Weakly Supervised Phrase Grounding	Code
2018	arXiv	CPC	Representation learning with contrastive predictive coding	Code
2019	ACM MM	KPRN	Knowledgeguided pairwise reconstruction network for weakly supervised referring expression grounding	Code
2021	ICCV	GbS	Detector-free weakly supervised grounding by separation	Code
2021	TPAMI	DTWREG	Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding	Code
2021	CVPR	ReIR	Relation-aware Instance Refinement for Weakly Supervised Visual Grounding	Code
2022	ICML	BLIP	BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	Code
2022	CVPR	Mask2Former	Masked-attention Mask Transformer for Universal Image Segmentation	Project Code
2023	ACM MM	CACMRN	Client-adaptive cross-model reconstruction network for modality-incomplete multimodal federated learning	N/A
2023	CVPR	g++	Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding	Code
2023	CVPR	RefCLIP	RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension	Code
2024	TOMM	UR	Universal Relocalizer for Weakly Supervised Referring Expression Grounding	N/A
2024	ICASSP	VPT-WSVG	Visual prompt tuning for weakly supervised phrase grounding	N/A
2024	MMM	PPT	Part-Aware Prompt Tuning for Weakly Supervised Referring Expression Grounding	N/A
2016	ECCV	GroundR	Grounding of Textual Phrases in Images by Reconstruction	N/A
2018	CVPR	MATN	Weakly Supervised Phrase Localization With Multi-Scale Anchored Transformer Network	N/A
2019	ICCV	ARN	Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding	Code
2019	ICCV	Align2Ground	Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment	N/A
2020	ECCV	info-ground	Contrastive Learning for Weakly Supervised Phrase Grounding	Project
2020	EMNLP	MAF	MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding	Code
2020	NeurIPS	CCL	Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding	N/A
2021	CVPR	NCE-Distill	Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation	N/A
2022	TPAMI	EARN	Entity-Enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding	Code
2022	IMCL	X-VLM	Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts	Code
2023	TMM	DRLF	A Dual Reinforcement Learning Framework for Weakly Supervised Phrase Grounding	N/A
2023	TIP	Cycle	Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations	Code
2023	ICRA	TGKD	Weakly Supervised Referring Expression Grounding via Target-Guided Knowledge Distillation	Code
2023	ICCV	CPL	Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding	Code
2024	CVPR	RSMPL	Regressor-Segmenter Mutual Prompt Learning for Crowd Counting	Code
2024	TCSVT	PSRN	Progressive Semantic Reconstruction Network for Weakly Supervised Referring Expression Grounding	Code
2024	ACM MM	QueryMatch	QueryMatch: A Query-based Contrastive Learning Framework for Weakly Supervised Visual Grounding	Code

1.3 Semi-supervised Setting

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2023	ICASSP	PQG-Distil	Pseudo-Query Generation For Semi-Supervised Visual Grounding With Knowledge Distillation	N/A
2021	WACV	LSEP	Utilizing Every Image Object for Semi-supervised Phrase Grounding	N/A
2022	CRV	SS-Ground	Semi-supervised Grounding Alignment for Multi-modal Feature Learning	N/A
2021	AAAI	Curriculum Labeling	Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning	Code
2023	ICASSP	PQG-Distil	Pseudo-Query Generation For Semi-Supervised Visual Grounding With Knowledge Distillation	N/A
2024	CoRR	ACTRESS	Actress: Active retraining for semi-supervised visual grounding	N/A
2021	WACV	LSEP	Utilizing Every Image Object for Semi-supervised Phrase Grounding	N/A
2022	CRV	SS-Ground	Semi-supervised Grounding Alignment for Multi-modal Feature Learning	N/A

2019	IJCAI	N/A	Learning unsupervised visual grounding through semantic self-supervision	N/A
2019	ICCV	N/A	Phrase Localization Without Paired Training Examples	N/A
2022	CVPR	Pseudo-Q	Pseudo-q: Generating pseudo language queries for visual grounding	Code
2023	Neurocomputing	BiCM	Unpaired referring expression grounding via bidirectional cross-modal matching	N/A
2024	Neurocomputing	N/A	Self-training: A survey	N/A
2024	CVPR	Omni-q	Omni-q: Omni-directional scene understanding for unsupervised visual grounding	N/A
2018	CVPR	N/A	Unsupervised Textual Grounding: Linking Words to Image Concepts	N/A

1.4 Unsupervised Setting

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2022	CVPR	Pseudo-Q	Pseudo-Q: Generating pseudo language queries for visual grounding	Code
2018	CVPR	N/A	Unsupervised Textual Grounding: Linking Words to Image Concepts	N/A
2023	TMM	CLIP-VG	CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding	Code
2024	ICME	VG-annotator	VG-Annotator: Vision-Language Models as Query Annotators for Unsupervised Visual Grounding	N/A
2023	TMM	CLIPREC	CLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression Comprehension	N/A
2019	IJCAI	N/A	Learning unsupervised visual grounding through semantic self-supervision	N/A
2019	ICCV	N/A	Phrase Localization Without Paired Training Examples	N/A
2022	CVPR	Pseudo-Q	Pseudo-q: Generating pseudo language queries for visual grounding	Code
2023	Neurocomputing	BiCM	Unpaired referring expression grounding via bidirectional cross-modal matching	N/A
2024	Neurocomputing	N/A	Self-training: A survey	N/A
2024	CVPR	Omni-q	Omni-q: Omni-directional scene understanding for unsupervised visual grounding	N/A
2018	CVPR	N/A	Unsupervised Textual Grounding: Linking Words to Image Concepts	N/A

1.5 Zero-shot Setting

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2019	ICCV	ZSGNet	Zero-shot Grounding of Objects from Natural Language Queries	Code
2022	ACL	ReCLIP	ReCLIP: A Strong Zero-shot Baseline for Referring Expression Comprehension	Code
2024	Neurocomputing	OV-VG	OV-VG: A Benchmark for Open-Vocabulary Visual Grounding	Code
2023	TMM	CLIPREC	CLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression Comprehension	N/A
2024	Neurocomputing	N/A	Zero-shot visual grounding via coarse-to-fine representation learning	Code
2022	Arxiv	adapting-CLIP	Adapting CLIP For Phrase Localization Without Further Training	Code
2023	ICLR	ChatRef	Language models can do zero-shot visual referring expression comprehension	Code
2024	AI Open	Cpt	CPT: Colorful Prompt Tuning for pre-trained vision-language models	Code
2021	CVPR	VinVL	VinVL: Revisiting Visual Representations in Vision-Language Models	Code
2024	CVPR	VR-VLA	Zero-shot referring expression comprehension via structural similarity between images and captions	Code
2024	AAAI	GroundVLP	Groundvlp: Harnessing zeroshot visual grounding from vision-language pre-training and openvocabulary object detection	Code
2024	TCSVT	MCCE-REC	MCCE-REC: MLLM-driven Cross-modal Contrastive Entropy Model for Zero-shot Referring Expression Comprehension	N/A
2024	ECCV	CRG	Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training	Code
2024	IJCNN	PSAIR	Psair: A neurosymbolic approach to zero-shot visual grounding	N/A
2024	TPAMI	TransCP	Context disentangling and prototype inheriting for robust visual grounding	Code
2024	TPAMI	N/A	Towards Open Vocabulary Learning: A Survey	Code
2024	CVPR	GEM	Grounding everything: Emerging localization properties in vision-language transformers	Code
2023	Arxiv	GRILL	Grill: Grounded vision-language pre-training via aligning text and image regions	N/A
2017	ICCV	Grad-CAM	Grad-CAM:Visual Explanations from Deep Networks via Gradient-based Localization	Code
2019	ICCV	ZSGNet	Zero-shot grounding of objects from natural language queries	Code
2022	ACL	ReCLIP	Reclip: A strong zero-shot baseline for referring expression comprehension	Code
2022	CVPR	GLIP	Grounded language-image pretraining	Code
2022	AAAI	MMKG	Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations	N/A
2021	CVPR	OVR-CNN	Open-vocabulary object detection using captions	Code
2024	ICLR	KOSMOS-2	GROUNDING MULTIMODAL LARGE LANGUAGE MODELS TO THE WORLD	Code
2024	Neurocomputing	OV-VG	OV-VG: A Benchmark for Open-Vocabulary Visual Grounding	Code

1.6 Multi-task Setting

A. REC with REG Multi-task Setting

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2024	Arxiv	VLM-VG	Learning visual grounding from generative vision and language model	N/A
2024	Arxiv	EEVG	An efficient and effective transformer decoder-based framework for multi-task visual grounding	Code
2006	INLGC	N/A	Building a Semantically Transparent Corpus for the Generation of Referring Expressions	Project
2010	ACL	N/A	Natural reference to objects in a visual domain	Code
2012	CL	Survey	Computational generation of referring expressions: A survey	N/A
2013	NAACL	N/A	Generating expressions that refer to visible object	Code
2016	CVPR	NMI	Generation and comprehension of unambiguous object descriptions	Code
2017	ICCV	Attribute	Referring Expression Generation and Comprehension via Attributes	N/A
2017	CVPR	SLR	A Joint Speaker-Listener-Reinforcer Model for Referring Expressions	N/A
2017	CVPR	CG	Comprehension-guided referring expressions	N/A
2024	AAAI	CyCo	A Joint Speaker-Listener-Reinforcer Model for Referring Expressions	N/A

B. REC with RES Multi-task Setting

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2020	CVPR	MCN	Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation	code
2021	NeurIPS	RefTR	Referring Transformer: A One-step Approach to Multi-task Visual Grounding	code
2022	ECCV	SeqTR	SeqTR: A Simple yet Universal Network for Visual Grounding	code
2023	CVPR	VG-LAW	Language Adaptive Weight Generation for Multi-task Visual Grounding	code
2024	Neurocomputing	M2IF	Improving visual grounding with multi-modal interaction and auto-regressive vertex generation	Code

C. Other Multi-task Setting

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2016	EMNLP	MCB	Multimodal compact bilinear pooling for visual question answering and visual groundin	Code
2024	CVPR	RefCount	Referring expression counting	Code
2022	CVPR	VizWiz-VQA-Grounding	Grounding Answers for Visual Questions Asked by Visually Impaired People	Project
2022	ECCV	N/A	Weakly supervised grounding for VQA in vision-language transformers	Code
2020	ACL	N/A	A Negative Case Analysis of Visual Grounding Methods for VQA	Code
2024	Arxiv	TrueVG	Uncovering the Full Potential of Visual Grounding Methods in VQA	Code
2020	IVC	N/A	Explaining VQA predictions using visual grounding and a knowledge base	N/A
2019	CVPR	N/A	Multi-task Learning of Hierarchical Vision-Language Representation	N/A

1.7 Generalized Visual Grounding

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2021	CVPR	OVR-CNN	Open-Vocabulary Object Detection Using Captions	Code
2021	ICCV	VLT	Vision-Language Transformer and Query Generation for Referring Segmentation	Code
2023	Arxiv	GREC	GREC:Generalized Referring Expression Comprehension	Code
2024	EMNLP	RECANTFormer	Recantformer: Referring expression comprehension with varying numbers of targets	N/A
2023	CVPR	gRefCOCO	GRES: Generalized Referring Expression Segmentation	Code
2024	ICCV	Ref-ZOM	Beyond One-to-One: Rethinking the Referring Image Segmentation	Code

2. Advanced Topics

2.1 NLP Language Structure Parsing in Visual Grounding

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2019	ICCV	NMTree	Learning to assemble neural module tree networks for visual grounding	N/A
2017	CVPR	CMN	Modeling relationships in referential expressions with compositional modular networks	Code
2015	EMNLP	N/A	An improved non-monotonic transitionsystem for dependency parsing	N/A
2014	EMNLP	N/A	A fast and accurate dependency parser using neural networks	N/A
2020	NSP	NLPPython	Natural language processing with Python and spaCy: A practical introduction	N/A
2020	Arxiv	Atanza	Stanza: A Python Natural Language Processing Toolkit for Many Human Languages	Project
2016	ECCV	N/A	Structured matching for phrase localization	N/A
2017	ICCV	N/A	Phrase localization and visual relationship detection with comprehensive image-language cues	Code
2022	CVPR	GLIP	Grounded language-image pretraining	Code
2017	ICCV	QRC Net	Query-guided regression network with context policy for phrase grounding	N/A
2006	ACL	NLTK	Nltk: the natural language toolkit	Code
2019	SNAMS	OpenNLP	A Replicable Comparison Study of NER Software:StanfordNLP,NLTK, OpenNLP, SpaCy, Gate	N/A
2018	Packt	Gensim	Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras	N/A
2013	ACL	Keras	Parsing with compositional vector grammars	N/A
2018	AAAI	GroundNet	Using Syntax to GroundReferring Expressions in Natural Images	Code
2019	TPAMI	RVGTree	Learning to Compose and Reason with Language Tree Structures for Visual Grounding	N/A
2024	CVPR	ARPGrounding	Investigating Compositional Challenges in Vision-Language Models for Visual Grounding	N/A

2.2 Spatial Relation and Graph Networks

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2023	TMM	CLIPREC	CLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression Comprehension	N/A
2024	ACM MM	ResVG	ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding	code
2023	Arxiv	Shikra		Code
2023	ACM MM	TAGRL	Towards adaptable graph representation learning: An adaptive multi-graph contrastive transformer	N/A
2020	AAAI	CMCC	Learning cross-modal context graph for visual grounding	code
2019	CVPR	LGRANs	Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks	N/A
2019	CVPR	CMRIN	Cross-Modal Relationship Inference for Grounding Referring Expressions	N/A
2019	ICCV	DGA	Dynamic Graph Attention for Referring Expression Comprehension	N/A
2024	TPAMI	N/A	A Survey on Graph Neural Networks and GraphTransformers in Computer Vision: A Task-Oriented Perspective	N/A

2.3 Modular Grounding

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2018	CVPR	Mattnet	Mattnet: Modular attention network for referring expression comprehension	Code
2017	CVPR	CMN	Modeling relationships in referential expressions with compositional modular networks	Code
2016	CVPR	NMN	Neural Module Networks	code
2019	CVPR	MTGCR	Modularized Textual Grounding for Counterfactual Resilience	N/A

3. Applications

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2019	CVPR	CAGDC	Context and Attribute Grounded Dense Captioning	N/A

3.1 Grounded Object Detection

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2024	NeurlPS	MQ-Det	Multi-modal queried object detection in the wild	code
2023	Arxiv	Shikra		Code
2022	CVPR	GLIP	Grounded language-image pretraining	Code
2024	CVPR	ScanFormer	ScanFormer: Referring Expression Comprehension by Iteratively Scanning	N/A
2024	Arxiv	Ref-L4	Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models	code

3.2 Referring Counting

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2024	CVPR	RefCount	Referring expression counting	Code

3.3 Remote Sensing Visual Grounding

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2024	TGRS	Rrsis	“Rrsis: Referring remote sensing image segmentation	code
2024	TGRS	LQVG	Language query based transformer with multi-scale cross-modal alignment for visual grounding on remote sensing images	code
2024	TGRS	RINet	A regionally indicated visual grounding network for remote sensing images	code
2024	GRSL	MSAM	Multi-stage synergistic aggregation network for remote sensing visual grounding	code
2024	GRSL	VSMR	Visual selection and multi-stage reasoning for rsvg	N/A
2024	TGRS	LPVA	Language-guided progressive attention for visual grounding in remote sensing images	code
2024	Arxiv	GeoGround	GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding	code
2023	TGRS	RSVG	RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data	N/A
2022	ACM MM	RSVG	Visual grounding in remote sensing images	code

3.4 Medical Visual Grounding

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2023	MICCAI	MedRPG	Medical Grounding with Region-Phrase Context Contrastive Alignment	N/A
2024	Arxiv	PFMVG	Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding	unavailable
2022	ECCV	CXR-BERT	Making the most of text semantics to improve biomedical vision–language processing	code
2017	CVPR	ChestX-ray8	Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases	N/A
2019	Arxiv	MIMIC-CXR-JPG	MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs	Code
2024	Arxiv	MedRG	MedRG: Medical Report Grounding with Multi-modal Large Language Model	N/A
2024	Arxiv	VividMed	VividMed: Vision Language Model with Versatile Visual Grounding for Medicine	Code
2023	Arxiv	ViLaM	ViLaM: A Vision-Language Model with Enhanced Visual Grounding and Generalization Capability	Code

3.5 3D Visual Grounding

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2022	CVPR	3D-SPS	3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection	Code
2021	ACMMM	TransRefer3D	TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding	Code
2020	ECCV	Scanrefer	Scanrefer: 3d object localization in rgb-d scans using natural language	Code
2020	ECCV	ReferIt3D	ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes	Code
2024	Arxiv	-	A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions	N/A

3.6 Video Object Grounding

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2020	CVPR	VOGNet	Video object grounding using semantic roles in language description	Code
2024	Arxiv	-	Described Spatial-Temporal Video Detection	unavailable
2023	TOMM	-	A survey on temporal sentence grounding in videos	N/A
2023	TPAMI	-	Temporal sentence grounding in videos: A survey and future directions	N/A
2024	CVPR	MC-TTA	Modality-Collaborative Test-Time Adaptation for Action Recognition	N/A
2023	CVPR	TransRMOT	Referring multi-object tracking	code

3.7 Robotic and Multimodal Agent Applications

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2018	CVPR	VLN	Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments	Data
2019	RAS	Dynamic-SLAM	Dynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environment	Code
2019	WCSP	N/A	Integrated Wearable Indoor Positioning System Based On Visible Light Positioning And Inertial Navigation Using Unscented Kalman Filter	N/A
2019	ICRA	Ground then Navigate	Ground then Navigate: Language-guided Navigation in Dynamic Scenes	Code
2023	MEAS SCI TECHNOL	FDO-Calibr	FDO-Calibr: visual-aided IMU calibration based on frequency-domain optimization	N/A
2024	arxiv	HiFi-CS	Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models	N/A
2025	ECCV	Ferret-UI	Grounded Mobile UI Understanding with Multimodal LLMs	N/A

4. Datasets and Benchmarks

3.1 The Five Datasets for Classical Visual Grounding

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2010	CVIU	N/A	The segmented and annotated iapr tc-12 benchmark	N/A
2014	ECCV	MS COCO	Microsoft COCO: Common Objects in Context	Project
2014	TACL	N/A	From image descriptions to visual denotations:Newsimilarity metrics for semantic inference over event descriptions	N/A
2015	ICCV	Flickr30k Entities	Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models	Code
2016	ECCV	RefCOCOg-umd	Modeling context between objects for referring expression understanding	N/A
2016	CVPR	RefCOCOg-g	Generation and comprehension of unambiguous object descriptions	Code
2016	ECCV	RefCOCO/+	Modeling context in referring expressions	Data
2017	IJCV	Visual genome	Visual genome: Connecting language and vision using crowdsourced dense image annotations	N/A
2019	CVPR	TD-SDR	TOUCHDOWN:NaturalLanguageNavigationandSpatialReasoning inVisualStreetEnvironments	Code
2019	CVPR	CLEVR	CLEVR:ADiagnostic Dataset for Compositional Language and Elementary Visual Reasoning	Code
2020	CVPR	REVERIE	REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments	Code
2020	CVPR	PANDA	PANDA: AGigapixel-level Human-centric Video Dataset	Code
2024	arxiv	DINO-X	DINO-X:AUnifiedVisionModelfor Open-WorldObjectDetectionandUnderstanding	Code
2024	arxiv	MC-Bench	MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs	Code
2025	arxiv	T-Rex2	T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy	Code

3.2 The Other Datasets for Classical Visual Grounding

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2024	Arxiv	VLM-VG	Learning visual grounding from generative vision and language model	N/A
2011	NeurIPS	SBU	Im2text: Describing images using 1 million captioned photographs	N/A
2016	CVPR	Visual7W	Visual7W: Grounded Question Answering in Images	Code
2017	CVPR	GuessWhat?!	GuessWhat?! Visual object discovery through multi-modal dialogue
2018	ACL	CC3M	Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning	Code
2019	CVPR	Clevr-ref+	Clevr-ref+: Diagnosing visual reasoning with referring expressions	Code
2019	arxiv	Object365	Objects as Points	Code
2020	IJCV	Open Image	The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale	Code
2020	CVPR	Cops-ref	Cops-ref: A new dataset and task on compositional referring expression comprehension	Code
2020	ACL	Refer360	Refer360: A referring expression recognition dataset in 360 images	Code
2021	CVPR	CC12M	Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts	Code
2023	ICCV	SAM	Segment Anything	Code

3.3 Dataset for the Newly Curated Scenarios

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2024	NeurIPS	D$^3$	Described Object Detection: Liberating Object Detection with Flexible Expressions	Code

A. Dataset for Generalized Visual Grounding

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2023	CVPR	gRefCOCO	GRES: Generalized Referring Expression Segmentation	Code
2024	ICCV	Ref-ZOM	Beyond One-to-One: Rethinking the Referring Image Segmentation	Code

B. Datasets and Benchmarks for GMLLMs

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2024	NeurIPS	HC-RefLoCo	A Large-Scale Human-Centric Benchmark for Referring Expression Comprehension in the LMM Era	Code
2024	ECCV	GVC	Llava-grounding: Grounded visual chat with large multimodal models	N/A
2024	ICLR	KOSMOS-2	GROUNDING MULTIMODAL LARGE LANGUAGE MODELS TO THE WORLD	Code

C. Dataset for Other Newly Curated Scenarios

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2024	CVPR	GigaGround	When visual grounding meets gigapixel-level large-scale scenes: Benchmark and approach	Code

5. Challenges And Outlook

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2024	Arxiv	-	AI Models Collapse When Trained on Recursively Generated Data	N/A
2024	CVPR	RefCount	Referring expression counting	Code
2024	CVPR	GigaGround	When visual grounding meets gigapixel-level large-scale scenes: Benchmark and approach	Code
2022	CVPR	GLIP	Grounded language-image pretraining	Code

6. Other Valuable Survey and Project

Year	Venue	Name	Paper Title / Paper Link	Code / Project
2018	TPAMI	N/A	Multimodal machine learning: A survey and taxonomy	N/A
2020	TMM	N/A	Referring expression comprehension: A survey of methods and datasets	N/A
2021	Github	awesome-grounding	N/A	Project
2023	TPAMI	Awesome-Open-Vocabulary	Towards Open Vocabulary Learning: A Survey	Project
2023	TPAMI	N/A	Multimodal learning with transformers: A survey	N/A
2024	Github	awesome-described-object-detection	N/A	awesome-described-object-detection

Acknowledgement

This survey took half a year to complete, and the process was laborious and burdensome.

Building up this GitHub repository also required significant effort. We would like to thank the following individuals for their contributions to completing this project: Baochen Xiong, Yifan Xu, Yaguang Song, Menghao Hu, Han Jiang, Hao Liu, Chenlin Zhao, Fang Peng, Xudong Yao, Zibo Shao, Kaichen Li, Jianhao Huang, Xianbing Yang, Shuaitong Li, Jisheng Yin, Yupeng Wu, Shaobo Xie, etc.

Contact

Email: [email protected]. Any kind discussions are welcomed!

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
figs		figs
LICENSE		LICENSE
README.md		README.md
one-click_push.sh		one-click_push.sh

License

linhuixiao/Awesome-Visual-Grounding

Folders and files

Latest commit

History

Repository files navigation