Referring Video Object Segmentation with Cross-Modality Proxy Queries
- URL: http://arxiv.org/abs/2511.21139v1
- Date: Wed, 26 Nov 2025 07:45:41 GMT
- Title: Referring Video Object Segmentation with Cross-Modality Proxy Queries
- Authors: Baoli Sun, Xinzhu Ma, Ning Wang, Zhihui Wang, Zhiyong Wang,
- Abstract summary: Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions.<n>Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism.<n>We propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics.
- Score: 23.504655272754587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.
Related papers
- Mitigating Query Selection Bias in Referring Video Object Segmentation [39.39279952650532]
We propose Triple Query Former (TQF) to factorize the referring query into three specialized components.<n>Instead of relying solely on textual embeddings, our queries are dynamically constructed by integrating both linguistic cues and visual guidance.
arXiv Detail & Related papers (2025-09-17T06:17:23Z) - SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction [65.15449703659772]
Video Object (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames.<n>We propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations.<n>SeC achieves an 11.8-point improvement over SAM SeCVOS, establishing a new state-of-the-art concept-aware video object segmentation.
arXiv Detail & Related papers (2025-07-21T17:59:02Z) - InterRVOS: Interaction-aware Referring Video Object Segmentation [44.55538737075162]
We introduce Interaction-aware Referring Video Object (InterRVOS), a novel task that focuses on the modeling of interactions.<n>It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction.<n>We present InterRVOS-127K, a large-scale dataset with over 127K automatically annotated expressions, including interaction expressions annotated with distinct masks for actor and target objects.
arXiv Detail & Related papers (2025-06-03T01:16:13Z) - CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos [59.391265901911005]
We propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address complex challenges by temporal-semantic reasoning.<n>CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding for each object that can be observed effortlessly among all frames (temporal)<n>Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge
arXiv Detail & Related papers (2025-05-24T07:01:31Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - Learning Spatial-Semantic Features for Robust Video Object Segmentation [108.045326229865]
We propose a robust video object segmentation framework that learns spatial-semantic features and discriminative object queries.<n>The proposed method achieves state-of-the-art performance on benchmark data sets, including the DAVIS 2017 test (textbf87.8%), YoutubeVOS 2019 (textbf88.1%), MOSE val (textbf74.0%), and LVOS test (textbf73.0%)
arXiv Detail & Related papers (2024-07-10T15:36:00Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Redundancy-aware Transformer for Video Question Answering [71.98116071679065]
We propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner.
To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames.
As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions.
arXiv Detail & Related papers (2023-08-07T03:16:24Z) - Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for
Referring Video Object Segmentation [44.952526831843386]
We propose a correlation-driven inter-frame interaction Transformer, dubbed BIFIT, to address these issues in RVOS.
Specifically, we design a lightweight plug-and-play inter-frame interaction module in the decoder.
A vision-ferring interaction is implemented before the Transformer to facilitate the correlation between the visual and linguistic features.
arXiv Detail & Related papers (2023-07-02T10:29:35Z) - SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - Target Adaptive Context Aggregation for Video Scene Graph Generation [36.669700084337045]
This paper deals with a challenging task of video scene graph generation (VidSGG)
We present a new em detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking.
arXiv Detail & Related papers (2021-08-18T12:46:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.