Related papers: Utilizing Every Image Object for Semi-supervised Phrase Grounding

Utilizing Every Image Object for Semi-supervised Phrase Grounding

URL: http://arxiv.org/abs/2011.02655v1
Date: Thu, 5 Nov 2020 04:25:25 GMT
Title: Utilizing Every Image Object for Semi-supervised Phrase Grounding
Authors: Haidong Zhu, Arka Sadhu, Zhaoheng Zheng, Ram Nevatia
Abstract summary: Phrase grounding models localize an object in the image given a referring expression. In this paper, we study the case applying objects without labeled queries for training the semi-supervised phrase grounding. We show that our predictors allow the grounding system to learn from the objects without labeled queries and improve accuracy by 34.9% relatively with the detection results.
Score: 25.36231298036066
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Phrase grounding models localize an object in the image given a referring expression. The annotated language queries available during training are limited, which also limits the variations of language combinations that a model can see during training. In this paper, we study the case applying objects without labeled queries for training the semi-supervised phrase grounding. We propose to use learned location and subject embedding predictors (LSEP) to generate the corresponding language embeddings for objects lacking annotated queries in the training set. With the assistance of the detector, we also apply LSEP to train a grounding model on images without any annotation. We evaluate our method based on MAttNet on three public datasets: RefCOCO, RefCOCO+, and RefCOCOg. We show that our predictors allow the grounding system to learn from the objects without labeled queries and improve accuracy by 34.9\% relatively with the detection results.

Related papers

Context-Informed Grounding Supervision [102.11698329887226]
Context-INformed Grounding Supervision (CINGS) is a post-training supervision in which the model is trained with relevant context prepended to the response.<n>Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains.
arXiv Detail & Related papers (2025-06-18T14:13:56Z)
Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding [29.035369822597218]
Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint.<n>Most methods typically follow a two-stage paradigm where a trained 3D detector's optimized backbone parameters are used to initialize a grounding model.<n>In this study, we assess the grounding performance of detection models using predicted boxes filtered by the target category.
arXiv Detail & Related papers (2025-06-05T16:11:57Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks. Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding. Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z)
Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references. We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting. Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z)
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection [24.48128633414131]
We propose a zero-shot method that harnesses visual grounding ability from existing models trained from image-text pairs and pure object detection data. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-12-22T20:14:55Z)
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z)
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models [28.746370086515977]
Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. We propose a framework to jointly study task performance and phrase grounding. We show how this can be addressed through brute-force training on ground phrasing annotations.
arXiv Detail & Related papers (2023-09-06T03:54:57Z)
Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z)
Robust Object Detection in Remote Sensing Imagery with Noisy and Sparse Geo-Annotations (Full Version) [4.493174773769076]
In this paper, we present a novel approach for training object detectors with extremely noisy and incomplete annotations. Our method is based on a teacher-student learning framework and a correction module accounting for imprecise and missing annotations. We demonstrate that our approach improves standard detectors by 37.1% $AP_50$ on a noisy real-world remote-sensing dataset.
arXiv Detail & Related papers (2022-10-24T07:25:31Z)
Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding [35.01174511816063]
We present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training. Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images. We develop a visual-language model equipped with multi-level cross-modality attention mechanism.
arXiv Detail & Related papers (2022-03-16T09:17:41Z)
Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching [53.27673119360868]
Referring expression grounding is an important and challenging task in computer vision. We propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
arXiv Detail & Related papers (2022-01-18T01:13:19Z)
Aligning Pretraining for Detection via Object-Level Contrastive Learning [57.845286545603415]
Image-level contrastive representation learning has proven to be highly effective as a generic model for transfer learning. We argue that this could be sub-optimal and thus advocate a design principle which encourages alignment between the self-supervised pretext task and the downstream task. Our method, called Selective Object COntrastive learning (SoCo), achieves state-of-the-art results for transfer performance on COCO detection.
arXiv Detail & Related papers (2021-06-04T17:59:52Z)
Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition. We show that 83.7% of test instances do not require reasoning on linguistic structure. We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.