Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation
- URL: http://arxiv.org/abs/2408.07539v1
- Date: Wed, 14 Aug 2024 13:17:41 GMT
- Title: Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation
- Authors: Yubin Cho, Hyunwoo Yu, Suk-ju Kang,
- Abstract summary: Referring segmentation aims to segment a target object related to a natural language expression.
Recent models have focused on the early fusion with the language features at the intermediate stage of the vision encoder.
This paper proposes a novel architecture, Cross-aware early fusion with stage-divided Vision and Language Transformer encoders.
- Score: 15.676384275867965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Referring segmentation aims to segment a target object related to a natural language expression. Key challenges of this task are understanding the meaning of complex and ambiguous language expressions and determining the relevant regions in the image with multiple objects by referring to the expression. Recent models have focused on the early fusion with the language features at the intermediate stage of the vision encoder, but these approaches have a limitation that the language features cannot refer to the visual information. To address this issue, this paper proposes a novel architecture, Cross-aware early fusion with stage-divided Vision and Language Transformer encoders (CrossVLT), which allows both language and vision encoders to perform the early fusion for improving the ability of the cross-modal context modeling. Unlike previous methods, our method enables the vision and language features to refer to each other's information at each stage to mutually enhance the robustness of both encoders. Furthermore, unlike the conventional scheme that relies solely on the high-level features for the cross-modal alignment, we introduce a feature-based alignment scheme that enables the low-level to high-level features of the vision and language encoders to engage in the cross-modal alignment. By aligning the intermediate cross-modal features in all encoder stages, this scheme leads to effective cross-modal fusion. In this way, the proposed approach is simple but effective for referring image segmentation, and it outperforms the previous state-of-the-art methods on three public benchmarks.
Related papers
- Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation [8.383431263616105]
We introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles.
Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information.
We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence.
arXiv Detail & Related papers (2024-05-18T07:21:12Z) - Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model [3.3772986620114387]
We introduce ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features.
Our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
arXiv Detail & Related papers (2024-04-19T07:24:32Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [80.54244087314025]
We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network.
Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
arXiv Detail & Related papers (2021-12-04T04:53:35Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.