A Simple Baseline with Single-encoder for Referring Image Segmentation
- URL: http://arxiv.org/abs/2408.15521v1
- Date: Wed, 28 Aug 2024 04:14:01 GMT
- Title: A Simple Baseline with Single-encoder for Referring Image Segmentation
- Authors: Seonghoon Yu, Ilchae Jung, Byeongju Han, Taeoh Kim, Yunho Kim, Dongyoon Wee, Jeany Son,
- Abstract summary: We present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention.
Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets.
- Score: 14.461024566536478
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Referring image segmentation (RIS) requires dense vision-language interactions between visual pixels and textual words to segment objects based on a given description. However, commonly adapted dual-encoders in RIS, e.g., Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal dual-encoder), lack dense multi-modal interactions during pre-training, leading to a gap with a pixel-level RIS task. To bridge this gap, existing RIS methods often rely on multi-modal fusion modules that interact two encoders, but this approach leads to high computational costs. In this paper, we present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention across all framework components. This enables seamless interactions of two modalities from input to final prediction, producing granularly aligned multi-modal features. Furthermore, we propose lightweight yet effective decoder modules, a Shared FPN and a Shared Mask Decoder, which contribute to the high efficiency of our model. Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets while maintaining computational efficiency, compared to the most recent SoTA methods based on dual-encoders.
Related papers
- MacFormer: Semantic Segmentation with Fine Object Boundaries [38.430631361558426]
We introduce a new semantic segmentation architecture, MacFormer'', which features two key components.
Firstly, using learnable agent tokens, a Mutual Agent Cross-Attention (MACA) mechanism effectively facilitates the bidirectional integration of features across encoder and decoder layers.
Secondly, a Frequency Enhancement Module (FEM) in the decoder leverages high-frequency and low-frequency components to boost features in the frequency domain.
MacFormer is demonstrated to be compatible with various network architectures and outperforms existing methods in both accuracy and efficiency on datasets benchmark ADE20K and Cityscapes.
arXiv Detail & Related papers (2024-08-11T05:36:10Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - Multi-interactive Feature Learning and a Full-time Multi-modality
Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation.
We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D
Object Detection [26.03582038710992]
Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities.
In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world.
We propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects.
arXiv Detail & Related papers (2023-03-14T17:58:03Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text
Retrieval [117.15862403330121]
We propose LoopITR, which combines dual encoders and cross encoders in the same network for joint learning.
Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder.
arXiv Detail & Related papers (2022-03-10T16:41:12Z) - Attention-based Dual Supervised Decoder for RGBD Semantic Segmentation [16.721758280029302]
We propose a novel attention-based dual supervised decoder for RGBD semantic segmentation.
In the encoder, we design a simple yet effective attention-based multimodal fusion module to extract and fuse deeply multi-level paired complementary information.
Our method achieves superior performance against the state-of-the-art methods.
arXiv Detail & Related papers (2022-01-05T03:12:27Z) - MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image
Segmentation [13.311777431243296]
MaIL is a more concise encoder-decoder pipeline with a Mask-Image-Language trimodal encoder.
MaIL unifies uni-modal feature extractors and their fusion model into a deep modality interaction encoder.
For the first time, we propose to introduce instance masks as an additional modality, which explicitly intensifies instance-level features.
arXiv Detail & Related papers (2021-11-21T05:54:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.