Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding
- URL: http://arxiv.org/abs/2108.00205v1
- Date: Sat, 31 Jul 2021 10:20:15 GMT
- Title: Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding
- Authors: Heng Zhao, Joey Tianyi Zhou and Yew-Soon Ong
- Abstract summary: We propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture.
The embedding of each word from the query sentence is treated alike by attending to visual pixels individually.
The proposed Word2Pix outperforms existing one-stage methods by a notable margin.
- Score: 59.8167502322261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current one-stage methods for visual grounding encode the language query as
one holistic sentence embedding before fusion with visual feature. Such a
formulation does not treat each word of a query sentence on par when modeling
language to visual attention, therefore prone to neglect words which are less
important for sentence embedding but critical for visual grounding. In this
paper we propose Word2Pix: a one-stage visual grounding network based on
encoder-decoder transformer architecture that enables learning for textual to
visual feature correspondence via word to pixel attention. The embedding of
each word from the query sentence is treated alike by attending to visual
pixels individually instead of single holistic sentence embedding. In this way,
each word is given equivalent opportunity to adjust the language to vision
attention towards the referent target through multiple stacks of transformer
decoder layers. We conduct the experiments on RefCOCO, RefCOCO+ and RefCOCOg
datasets and the proposed Word2Pix outperforms existing one-stage methods by a
notable margin. The results obtained also show that Word2Pix surpasses
two-stage visual grounding models, while at the same time keeping the merits of
one-stage paradigm namely end-to-end training and real-time inference speed
intact.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts [48.28000728061778]
We propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene.
Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model.
arXiv Detail & Related papers (2024-04-08T18:24:12Z) - VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process.
VideoDistill generates answers only from question-related visual embeddings.
We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z) - Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios.
The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors.
We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z) - CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes [93.71909293023663]
Cross-modality Aligned Prototypes (CAPro) is a unified contrastive learning framework to learn visual representations with correct semantics.
CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition.
arXiv Detail & Related papers (2023-10-15T07:20:22Z) - VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - I2C2W: Image-to-Character-to-Word Transformers for Accurate Scene Text
Recognition [68.95544645458882]
This paper presents I2C2W, a novel scene text recognizer that is accurate and tolerant to various noises in scenes.
I2C2W consists of an image-to-character module (I2C) and a character-to-word module (C2W) which are complementary and can be trained end-to-end.
arXiv Detail & Related papers (2021-05-18T09:20:58Z) - Learning word-referent mappings and concepts from raw inputs [18.681222155879656]
We present a neural network model trained from scratch via self-supervision that takes in raw images and words as inputs.
The model generalizes to novel word instances, locates referents of words in a scene, and shows a preference for mutual exclusivity.
arXiv Detail & Related papers (2020-03-12T02:18:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.