LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension
- URL: http://arxiv.org/abs/2511.12020v1
- Date: Sat, 15 Nov 2025 04:06:57 GMT
- Title: LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension
- Authors: Xianglong Shi, Silin Cheng, Sirui Zhao, Yunhan Jiang, Enhong Chen, Yang Liu, Sebastien Ourselin,
- Abstract summary: Existing Weakly-Supervised Referring Expression (WREC) methods are fundamentally limited by a one-to-one mapping assumption.<n>We introduce the Weakly-Supervised Generalized Referring Expression task (WGREC), a more practical paradigm that handles expressions with variable numbers of referents.<n>We propose a novel WGREC framework named Linguistic Instance-Split-Euclidean (LIHE), which operates in two stages.
- Score: 42.52759428579815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing Weakly-Supervised Referring Expression Comprehension (WREC) methods, while effective, are fundamentally limited by a one-to-one mapping assumption, hindering their ability to handle expressions corresponding to zero or multiple targets in realistic scenarios. To bridge this gap, we introduce the Weakly-Supervised Generalized Referring Expression Comprehension task (WGREC), a more practical paradigm that handles expressions with variable numbers of referents. However, extending WREC to WGREC presents two fundamental challenges: supervisory signal ambiguity, where weak image-level supervision is insufficient for training a model to infer the correct number and identity of referents, and semantic representation collapse, where standard Euclidean similarity forces hierarchically-related concepts into non-discriminative clusters, blurring categorical boundaries. To tackle these challenges, we propose a novel WGREC framework named Linguistic Instance-Split Hyperbolic-Euclidean (LIHE), which operates in two stages. The first stage, Referential Decoupling, predicts the number of target objects and decomposes the complex expression into simpler sub-expressions. The second stage, Referent Grounding, then localizes these sub-expressions using HEMix, our innovative hybrid similarity module that synergistically combines the precise alignment capabilities of Euclidean proximity with the hierarchical modeling strengths of hyperbolic geometry. This hybrid approach effectively prevents semantic collapse while preserving fine-grained distinctions between related concepts. Extensive experiments demonstrate LIHE establishes the first effective weakly supervised WGREC baseline on gRefCOCO and Ref-ZOM, while HEMix achieves consistent improvements on standard REC benchmarks, improving IoU@0.5 by up to 2.5\%. The code is available at https://anonymous.4open.science/r/LIHE.
Related papers
- ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation [21.87321809019825]
Referring Expression (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions.<n>textbfmodel is a novel RES framework integrating textbfEntropy-textbfBased Point textbfDiscovery (textbfEBD) and textbfVision-textbfBased textbfReasoning (textbfVBR)<n>model implements a coarse-to
arXiv Detail & Related papers (2026-01-23T01:56:04Z) - Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion [31.189038928192648]
Co2S is a semi-supervised RS segmentation framework that fuses priors from vision-language models and self-supervised models.<n>An explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries.<n>Experiments on six popular datasets demonstrate the superiority of the proposed method.
arXiv Detail & Related papers (2025-12-28T18:24:19Z) - HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations [11.678218711095269]
Graph-based RAG enables large language models to access external knowledge.<n>We propose HyperbolicRAG, a retrieval framework that integrates hyperbolic geometry into graph-based RAG.
arXiv Detail & Related papers (2025-11-24T06:27:58Z) - SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation [58.80001825332851]
Referring Image (RIS) aims to segment the target object in an image given a natural language expression.<n>Recent methods predominantly focus on simple expressions like "red car" or "left girl"
arXiv Detail & Related papers (2025-10-11T10:50:58Z) - RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models [25.265113510539546]
Referring Remote Sensing Image provides a flexible and fine-grained framework for remote sensing scene analysis.<n>Current approaches utilize a three-stage pipeline encompassing dual-modal encoding, cross-modal interaction, and pixel decoding.<n>We propose RSRefSeg 2, a decoupling paradigm that reformulates the conventional workflow into a collaborative dual-stage framework.
arXiv Detail & Related papers (2025-07-08T17:59:58Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - LEAF: Unveiling Two Sides of the Same Coin in Semi-supervised Facial Expression Recognition [56.22672276092373]
Semi-supervised learning has emerged as a promising approach to tackle the challenge of label scarcity in facial expression recognition.<n>We propose a unified framework termed hierarchicaL dEcoupling And Fusing (LEAF) to coordinate expression-relevant representations and pseudo-labels for semi-supervised FER.
arXiv Detail & Related papers (2024-04-23T13:43:33Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised
Referring Expression Grounding [214.8003571700285]
Weakly supervised Referring Expression Grounding (REG) aims to ground a particular target in an image described by a language expression.
We design an entity-enhanced adaptive reconstruction network (EARN)
EARN includes three modules: entity enhancement, adaptive grounding, and collaborative reconstruction.
arXiv Detail & Related papers (2022-07-18T05:30:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.