Related papers: VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

URL: http://arxiv.org/abs/2210.15871v1
Date: Fri, 28 Oct 2022 03:36:07 GMT
Title: VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
Authors: Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang
Abstract summary: We propose a framework for referring segmentation to facilitate deep interactions among multi-modal information. We introduce masked contrastive learning to narrow down the features of different expressions for the same target object. The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.
Score: 31.051579752237746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways to understand the dynamic emphasis of a language expression, especially when interacting with the image. However, the learned queries in existing transformer works are fixed after training, which cannot cope with the randomness and huge diversity of the language expressions. To address this issue, we propose a Query Generation Module, which dynamically produces multiple sets of input-specific queries to represent the diverse comprehensions of language expression. To find the best among these diverse comprehensions, so as to generate a better mask, we propose a Query Balance Module to selectively fuse the corresponding responses of the set of queries. Furthermore, to enhance the model's ability in dealing with diverse language expressions, we consider inter-sample learning to explicitly endow the model with knowledge of understanding different language expressions to the same object. We introduce masked contrastive learning to narrow down the features of different expressions for the same target object while distinguishing the features of different objects. The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.

Related papers

QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries [2.306164598536725]
We present a novel framework for rapidly adapting a pre-trained VLM to respond to a natural language query. We use unlabelled data collected during previous deployments to align VLM features with semantic classes related to the query. We also explore how objects unrelated to the query should be dealt with when using real-world data for adaptation.
arXiv Detail & Related papers (2025-02-26T01:07:28Z)
Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval [21.982906171156888]
Cross-lingual Cross-modal Retrieval aims to align vision and the low-resource language (the target language) without using any human-labeled target-language data. We propose Dynamic Adapter with Semantics Disentangling (DASD), whose parameters are dynamically generated conditioned on the characteristics of the input captions.
arXiv Detail & Related papers (2024-12-18T05:19:09Z)
Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval [26.585985828583304]
We propose an end-to-end multimodal retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries. To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval dataset automatically constructed from visual dialogue datasets. We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios.
arXiv Detail & Related papers (2024-11-13T04:32:58Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
Morphosyntactic probing of multilingual BERT models [41.83131308999425]
We introduce an extensive dataset for multilingual probing of morphological information in language models. We find that pre-trained Transformer models (mBERT and XLM-RoBERTa) learn features that attain strong performance across these tasks.
arXiv Detail & Related papers (2023-06-09T19:15:20Z)
Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)
Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z)
Vision-Language Transformer and Query Generation for Referring Segmentation [39.01244764840372]
We reformulate referring segmentation as a direct attention problem. We build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression. Our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets.
arXiv Detail & Related papers (2021-08-12T07:24:35Z)
Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels. We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z)
VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z)
Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities. We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.