OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
- URL: http://arxiv.org/abs/2310.14374v1
- Date: Sun, 22 Oct 2023 17:54:53 GMT
- Title: OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
- Authors: Chunlei Wang, Wenquan Feng, Xiangtai Li, Guangliang Cheng, Shuchang
Lyu, Binghao Liu, Lijiang Chen and Qi Zhao
- Abstract summary: This research endeavor introduces novel and challenging open-vocabulary visual tasks.
The overarching aim is to establish connections between language descriptions and the localization of novel objects.
We have curated a benchmark, encompassing 7,272 OV-VG images and 1,000 OV-PL images.
- Score: 33.02137080950678
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary learning has emerged as a cutting-edge research area,
particularly in light of the widespread adoption of vision-based foundational
models. Its primary objective is to comprehend novel concepts that are not
encompassed within a predefined vocabulary. One key facet of this endeavor is
Visual Grounding, which entails locating a specific region within an image
based on a corresponding language description. While current foundational
models excel at various visual language tasks, there's a noticeable absence of
models specifically tailored for open-vocabulary visual grounding. This
research endeavor introduces novel and challenging OV tasks, namely
Open-Vocabulary Visual Grounding and Open-Vocabulary Phrase Localization. The
overarching aim is to establish connections between language descriptions and
the localization of novel objects. To facilitate this, we have curated a
comprehensive annotated benchmark, encompassing 7,272 OV-VG images and 1,000
OV-PL images. In our pursuit of addressing these challenges, we delved into
various baseline methodologies rooted in existing open-vocabulary object
detection, VG, and phrase localization frameworks. Surprisingly, we discovered
that state-of-the-art methods often falter in diverse scenarios. Consequently,
we developed a novel framework that integrates two critical components:
Text-Image Query Selection and Language-Guided Feature Attention. These modules
are designed to bolster the recognition of novel categories and enhance the
alignment between visual and linguistic information. Extensive experiments
demonstrate the efficacy of our proposed framework, which consistently attains
SOTA performance across the OV-VG task. Additionally, ablation studies provide
further evidence of the effectiveness of our innovative models. Codes and
datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG.
Related papers
- Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene.
We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes.
We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z) - Hyperbolic Learning with Synthetic Captions for Open-World Detection [26.77840603264043]
We propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically.
Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images.
We also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings.
arXiv Detail & Related papers (2024-04-07T17:06:22Z) - LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector.
It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training.
Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z) - VLLaVO: Mitigating Visual Gap through LLMs [7.352822795984628]
Cross-domain learning aims at extracting domain-invariant knowledge to reduce the domain shift between training and testing data.
We propose VLLaVO, combining Vision language models and Large Language models as Visual cross-dOmain learners.
arXiv Detail & Related papers (2024-01-06T16:33:39Z) - Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios.
The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors.
We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z) - Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation [45.02052030837188]
We study open-vocabulary domain adaptation (OVDA), a new unsupervised domain adaptation framework.
We design a Prompt Ensemble Self-training (PEST) technique that exploits the synergy between vision and language.
PEST outperforms the state-of-the-art consistently across 10 image recognition tasks.
arXiv Detail & Related papers (2023-06-29T03:39:35Z) - Towards Open Vocabulary Learning: A Survey [146.90188069113213]
Deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection.
Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training.
This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2023-06-28T02:33:06Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual
Concepts [14.808701042367401]
We argue that the use of object detection may not be suitable for vision language pre-training.
This paper proposes a new method called X-VLM to perform multi-grained vision language pre-training'
arXiv Detail & Related papers (2021-11-16T07:55:26Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.