DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy
- URL: http://arxiv.org/abs/2507.01738v1
- Date: Wed, 02 Jul 2025 14:14:35 GMT
- Title: DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy
- Authors: Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang,
- Abstract summary: We propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition.<n>Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models.<n>We introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement.
- Score: 15.729826041347144
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.
Related papers
- AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z) - RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation [14.203806360052567]
Referring ImageHide (RIS) aims to segment specific objects based on natural language descriptions.<n>Existing datasets and methods are typically designed for high-altitude and static-view imagery.<n>We present RIS-LAD, the first fine-grained RIS benchmark tailored for Low-Altitude Drone (LAD) scenarios.
arXiv Detail & Related papers (2025-07-28T15:21:03Z) - DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models [9.109484087832058]
DiffRIS is a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for RRSIS tasks.<n>Our framework introduces two key innovations: a context perception adapter (CP-adapter) and a cross-modal reasoning decoder (PCMRD)
arXiv Detail & Related papers (2025-06-23T02:38:56Z) - A Lightweight Deep Exclusion Unfolding Network for Single Image Reflection Removal [68.0573194557999]
Single Image Reflection Removal (SIRR) is a canonical blind source separation problem.<n>We propose a novel Deep Exclusion unfolding Network (DExNet) for SIRR.<n>DExNet is constructed by unfolding and parameterizing a simple iterative Sparse and Auxiliary Feature Update (i-SAFU) algorithm.
arXiv Detail & Related papers (2025-03-03T07:54:27Z) - Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints [15.541287957548771]
We propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture.<n>It integrates implicit and explicit modeling approaches within a two-stage framework.<n>It significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.
arXiv Detail & Related papers (2025-01-12T04:30:13Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - Towards to Robust and Generalized Medical Image Segmentation Framework [17.24628770042803]
We propose a novel two-stage framework for robust generalized segmentation.
In particular, an unsupervised Tile-wise AutoEncoder (T-AE) pretraining architecture is coined to learn meaningful representation.
Experiments of lung segmentation on multi chest X-ray datasets are conducted.
arXiv Detail & Related papers (2021-08-09T05:58:49Z) - Learning Domain Invariant Representations for Generalizable Person
Re-Identification [71.35292121563491]
Generalizable person Re-Identification (ReID) has attracted growing attention in recent computer vision community.
We introduce causality into person ReID and propose a novel generalizable framework, named Domain Invariant Representations for generalizable person Re-Identification (DIR-ReID)
arXiv Detail & Related papers (2021-03-29T18:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.