VLT: Vision-Language Transformer and Query Generation for Referring
Segmentation
- URL: http://arxiv.org/abs/2210.15871v1
- Date: Fri, 28 Oct 2022 03:36:07 GMT
- Title: VLT: Vision-Language Transformer and Query Generation for Referring
Segmentation
- Authors: Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang
- Abstract summary: We propose a framework for referring segmentation to facilitate deep interactions among multi-modal information.
We introduce masked contrastive learning to narrow down the features of different expressions for the same target object.
The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.
- Score: 31.051579752237746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a Vision-Language Transformer (VLT) framework for referring
segmentation to facilitate deep interactions among multi-modal information and
enhance the holistic understanding to vision-language features. There are
different ways to understand the dynamic emphasis of a language expression,
especially when interacting with the image. However, the learned queries in
existing transformer works are fixed after training, which cannot cope with the
randomness and huge diversity of the language expressions. To address this
issue, we propose a Query Generation Module, which dynamically produces
multiple sets of input-specific queries to represent the diverse comprehensions
of language expression. To find the best among these diverse comprehensions, so
as to generate a better mask, we propose a Query Balance Module to selectively
fuse the corresponding responses of the set of queries. Furthermore, to enhance
the model's ability in dealing with diverse language expressions, we consider
inter-sample learning to explicitly endow the model with knowledge of
understanding different language expressions to the same object. We introduce
masked contrastive learning to narrow down the features of different
expressions for the same target object while distinguishing the features of
different objects. The proposed approach is lightweight and achieves new
state-of-the-art referring segmentation results consistently on five datasets.
Related papers
- Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval [26.585985828583304]
We propose an end-to-end multimodal retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries.
To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval dataset automatically constructed from visual dialogue datasets.
We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios.
arXiv Detail & Related papers (2024-11-13T04:32:58Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Morphosyntactic probing of multilingual BERT models [41.83131308999425]
We introduce an extensive dataset for multilingual probing of morphological information in language models.
We find that pre-trained Transformer models (mBERT and XLM-RoBERTa) learn features that attain strong performance across these tasks.
arXiv Detail & Related papers (2023-06-09T19:15:20Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Vision-Language Transformer and Query Generation for Referring
Segmentation [39.01244764840372]
We reformulate referring segmentation as a direct attention problem.
We build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression.
Our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets.
arXiv Detail & Related papers (2021-08-12T07:24:35Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.