Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding
        - URL: http://arxiv.org/abs/2203.08481v1
- Date: Wed, 16 Mar 2022 09:17:41 GMT
- Title: Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding
- Authors: Haojun Jiang, Yuanze Lin, Dongchen Han, Shiji Song, Gao Huang
- Abstract summary: We present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training.
Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images.
We develop a visual-language model equipped with multi-level cross-modality attention mechanism.
- Score: 35.01174511816063
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract:   Visual grounding, i.e., localizing objects in images according to natural
language queries, is an important topic in visual language understanding. The
most effective approaches for this task are based on deep learning, which
generally require expensive manually labeled image-query or patch-query pairs.
To eliminate the heavy dependence on human annotations, we present a novel
method, named Pseudo-Q, to automatically generate pseudo language queries for
supervised training. Our method leverages an off-the-shelf object detector to
identify visual objects from unlabeled images, and then language queries for
these objects are obtained in an unsupervised fashion with a pseudo-query
generation module. Then, we design a task-related query prompt module to
specifically tailor generated pseudo language queries for visual grounding
tasks. Further, in order to fully capture the contextual relationships between
images and language queries, we develop a visual-language model equipped with
multi-level cross-modality attention mechanism. Extensive experimental results
demonstrate that our method has two notable benefits: (1) it can reduce human
annotation costs significantly, e.g., 31% on RefCOCO without degrading original
model's performance under the fully supervised setting, and (2) without bells
and whistles, it achieves superior or comparable performance compared to
state-of-the-art weakly-supervised visual grounding methods on all the five
datasets we have experimented. Code is available at
https://github.com/LeapLabTHU/Pseudo-Q.
 
      
        Related papers
        - CTRL-O: Language-Controllable Object-Centric Visual Representation   Learning [30.218743514199016]
 Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files"
Current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented.
We propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions.
 arXiv  Detail & Related papers  (2025-03-27T17:53:50Z)
- ResVG: Enhancing Relation and Semantic Understanding in Multiple   Instances for Visual Grounding [42.10086029931937]
 Visual grounding aims to localize the object referred to in an image based on a natural language query.
Existing methods demonstrate a significant performance drop when there are multiple distractions in an image.
We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
 arXiv  Detail & Related papers  (2024-08-29T07:32:01Z)
- LanGWM: Language Grounded World Model [24.86620763902546]
 We focus on learning language-grounded visual features to enhance the world model learning.
Our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction.
 arXiv  Detail & Related papers  (2023-11-29T12:41:55Z)
- DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
 Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision.
We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
 arXiv  Detail & Related papers  (2023-06-24T21:05:02Z)
- Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
 We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
 arXiv  Detail & Related papers  (2022-11-27T14:47:31Z)
- Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
 We propose a unified framework that learns different tasks in a single architecture.
Our models learn to generate labels in text based on the visual and textual inputs.
Our generative approach shows better generalization ability on answering questions that have rare answers.
 arXiv  Detail & Related papers  (2021-02-04T17:59:30Z)
- Vokenization: Improving Language Understanding with Contextualized,
  Visual-Grounded Supervision [110.66085917826648]
 We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
 arXiv  Detail & Related papers  (2020-10-14T02:11:51Z)
- Words aren't enough, their order matters: On the Robustness of Grounding
  Visual Referring Expressions [87.33156149634392]
 We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
 arXiv  Detail & Related papers  (2020-05-04T17:09:15Z)
- Probing Contextual Language Models for Common Ground with Visual
  Representations [76.05769268286038]
 We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
 arXiv  Detail & Related papers  (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.