Related papers: LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

URL: http://arxiv.org/abs/2507.06272v2
Date: Mon, 14 Jul 2025 09:49:47 GMT
Title: LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
Authors: Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma, Shuo Zhang, Liang Yin, Linger Deng, Yabo Sun, Yuliang Liu, Xiang Bai,
Abstract summary: Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
Score: 56.474856189865946
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.

Related papers

LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation [12.192429756057132]
Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories.<n>LoGoSeg integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context.
arXiv Detail & Related papers (2026-02-05T12:03:11Z)
FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z)
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z)
Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models [35.947354809849166]
Open-Vocabulary Camouflaged Object seeks to segment and classify camouflaged objects from arbitrary categories.<n>Recent approaches typically adopt a two-stage paradigm: first segmenting objects, then classifying the segmented regions.<n>This paper introduces a novel VLM-guided cascaded framework to address these issues in OVCOS.
arXiv Detail & Related papers (2025-06-24T04:16:41Z)
Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation [12.67400143793047]
We propose a framework named textitprompt-generated semantic localization guiding Segment Anything Model(PSLG-SAM)<n>PSLG-SAM decomposes the Reference Remote Sensing Image (RRSIS) task into two stages: coarse localization and fine segmentation.<n> Notably, the second stage can be train-free, significantly reducing the annotation data burden for the RRSIS task.
arXiv Detail & Related papers (2025-06-12T09:04:07Z)
SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection [16.89965584177711]
Recent open-vocabulary human-object interaction (OV-HOI) detection methods rely on large language model (LLM) for generating auxiliary descriptions and leverage knowledge distilled from CLIP to detect unseen interaction categories.<n>Despite their effectiveness, these methods face two challenges: (1) feature granularity deficiency, due to reliance on last layer visual features for text alignment, leading to the neglect of crucial object-level details from intermediate layers; (2) semantic similarity confusion, resulting from CLIP's inherent biases toward certain classes, while LLM-generated descriptions based solely on labels fail to adequately capture inter-class similarities.
arXiv Detail & Related papers (2025-03-01T09:26:05Z)
CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models [2.331828779757202]
We present CALICO, the first Large Vision-Language Models (LVLM) designed for multi-image part-level reasoning segmentation.<n> CALICO features two key components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Adaptation Correspondence Modules that embed this information into the LVLM.<n>We show that CALICO, with just 0.3% of its parameters finetuned, achieves strong performance on this challenging task.
arXiv Detail & Related papers (2024-12-26T18:59:37Z)
Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps. We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z)
Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z)
Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module. Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z)
Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z)
Framework-agnostic Semantically-aware Global Reasoning for Segmentation [29.69187816377079]
We propose a component that learns to project image features into latent representations and reason between them. Our design encourages the latent regions to represent semantic concepts by ensuring that the activated regions are spatially disjoint. Our latent tokens are semantically interpretable and diverse and provide a rich set of features that can be transferred to downstream tasks.
arXiv Detail & Related papers (2022-12-06T21:42:05Z)
Progressively Dual Prior Guided Few-shot Semantic Segmentation [57.37506990980975]
Few-shot semantic segmentation task aims at performing segmentation in query images with a few annotated support samples. We propose a progressively dual prior guided few-shot semantic segmentation network.
arXiv Detail & Related papers (2022-11-20T16:19:47Z)
Unsupervised segmentation via semantic-apparent feature fusion [21.75371777263847]
This research proposes an unsupervised foreground segmentation method based on semantic-apparent feature fusion (SAFF) Key regions of foreground object can be accurately responded via semantic features, while apparent features provide richer detailed expression. By fusing semantic and apparent features, as well as cascading the modules of intra-image adaptive feature weight learning and inter-image common feature learning, the research achieves performance that significantly exceeds baselines.
arXiv Detail & Related papers (2020-05-21T08:28:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.