Related papers: AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

URL: http://arxiv.org/abs/2602.22740v1
Date: Thu, 26 Feb 2026 08:29:04 GMT
Title: AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation
Authors: Tongfei Chen, Shuo Yang, Yuguang Yang, Linlin Yang, Runtang Guo, Changbai Li, He Long, Chunyu Xie, Dawei Leng, Baochang Zhang,
Abstract summary: Referring Image introduces (RIS) aims to segment an object in an image identified by a natural language expression.<n>This paper introduces Alignment-Aware Masked Learning (AML), a training strategy to enhance RIS by explicitly estimating pixel-level vision-language alignment.<n>This approach results in state-of-the-art performance on RefCOCO datasets and also enhances robustness to diverse descriptions and scenarios.
Score: 28.871630416634883
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Referring Image Segmentation (RIS) aims to segment an object in an image identified by a natural language expression. The paper introduces Alignment-Aware Masked Learning (AML), a training strategy to enhance RIS by explicitly estimating pixel-level vision-language alignment, filtering out poorly aligned regions during optimization, and focusing on trustworthy cues. This approach results in state-of-the-art performance on RefCOCO datasets and also enhances robustness to diverse descriptions and scenarios

Related papers

Text4Seg++: Advancing Image Segmentation via Generative Language Modeling [52.07442359419673]
We propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem.<n>Key innovation is semantic descriptors, a new textual representation of segmentation masks.<n>Experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-09-08T04:07:14Z)
DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation [23.33389872430993]
We propose a training-free framework, DGL-RSIS, that decouples visual and textual inputs.<n>The framework performs visual-language alignment at both the local semantic and global contextual levels.<n>By matching the enhanced text features with mask-guided visual features, we enable the mask classification.
arXiv Detail & Related papers (2025-08-30T19:45:25Z)
Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model [0.8747606955991707]
We propose a vision-language model that incorporates architectural improvements to enhance segmentation accuracy and cross-modal alignment.<n>SegVLM shows strong generalization across diverse datasets and referring expression scenarios.
arXiv Detail & Related papers (2025-05-25T17:42:53Z)
FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation [63.31007867379312]
Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions.<n>A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information.<n>In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information.<n>We propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation.
arXiv Detail & Related papers (2025-01-01T15:47:04Z)
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation [38.3201448852059]
Referring Image Code (RIS) is an advanced vision-aware task that involves identifying and segmenting objects within an image.<n>We propose a novel training framework called Masked Referring Image Code (MaskRIS)<n>MaskRIS uses both image and text masking, followed by Contextual Learning to fully exploit the benefits of the masking strategy.
arXiv Detail & Related papers (2024-11-28T11:27:56Z)
Subobject-level Image Tokenization [60.80949852899857]
Patch-based image tokenization ignores the morphology of the visual world.<n>Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation.<n>We show that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
arXiv Detail & Related papers (2024-02-22T06:47:44Z)
EAVL: Explicitly Align Vision and Language for Referring Image Segmentation [27.351940191216343]
We introduce a Vision-Language Aligner that aligns features in the segmentation stage using dynamic convolution kernels based on the input image and sentence. Our method harnesses the potential of the multi-modal features in the segmentation stage and aligns language features of different emphases with image features to achieve fine-grained text-to-pixel correlation.
arXiv Detail & Related papers (2023-08-18T18:59:27Z)
Linguistic Query-Guided Mask Generation for Referring Image Segmentation [10.130530501400079]
Referring image segmentation aims to segment the image region of interest according to the given language expression. We propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation.
arXiv Detail & Related papers (2023-01-16T13:38:22Z)
Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations. Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations. Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.