LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
- URL: http://arxiv.org/abs/2112.02244v1
- Date: Sat, 4 Dec 2021 04:53:35 GMT
- Title: LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
- Authors: Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, Philip
H.S. Torr
- Abstract summary: We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network.
Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
- Score: 80.54244087314025
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Referring image segmentation is a fundamental vision-language task that aims
to segment out an object referred to by a natural language expression from an
image. One of the key challenges behind this task is leveraging the referring
expression for highlighting relevant positions in the image. A paradigm for
tackling this problem is to leverage a powerful vision-language ("cross-modal")
decoder to fuse features independently extracted from a vision encoder and a
language encoder. Recent methods have made remarkable advancements in this
paradigm by exploiting Transformers as cross-modal decoders, concurrent to the
Transformer's overwhelming success in many other vision-language tasks.
Adopting a different approach in this work, we show that significantly better
cross-modal alignments can be achieved through the early fusion of linguistic
and visual features in intermediate layers of a vision Transformer encoder
network. By conducting cross-modal feature fusion in the visual feature
encoding stage, we can leverage the well-proven correlation modeling power of a
Transformer encoder for excavating helpful multi-modal context. This way,
accurate segmentation results are readily harvested with a light-weight mask
predictor. Without bells and whistles, our method surpasses the previous
state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.
Related papers
- Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation [15.676384275867965]
Referring segmentation aims to segment a target object related to a natural language expression.
Recent models have focused on the early fusion with the language features at the intermediate stage of the vision encoder.
This paper proposes a novel architecture, Cross-aware early fusion with stage-divided Vision and Language Transformer encoders.
arXiv Detail & Related papers (2024-08-14T13:17:41Z) - An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding [17.855998090452058]
We propose an efficient and effective multi-task visual grounding framework based on Transformer Decoder.
In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries.
In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores.
arXiv Detail & Related papers (2024-08-02T09:01:05Z) - MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer [0.0]
We present MDS-ViTNet (Multi Decoder Saliency by Vision Transformer Network) for enhancing visual saliency prediction or eye-tracking.
This approach holds significant potential for diverse fields, including marketing, medicine, robotics, and retail.
Our trained model achieves state-of-the-art results across several benchmarks.
arXiv Detail & Related papers (2024-05-29T20:28:04Z) - Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation [8.383431263616105]
We introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles.
Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information.
We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence.
arXiv Detail & Related papers (2024-05-18T07:21:12Z) - Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z) - Fully Transformer Networks for Semantic ImageSegmentation [26.037770622551882]
We explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN)
We propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT)
Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation.
arXiv Detail & Related papers (2021-06-08T05:15:28Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.