Mitigating Query Selection Bias in Referring Video Object Segmentation
- URL: http://arxiv.org/abs/2509.13722v1
- Date: Wed, 17 Sep 2025 06:17:23 GMT
- Title: Mitigating Query Selection Bias in Referring Video Object Segmentation
- Authors: Dingwei Zhang, Dong Zhang, Jinhui Tang,
- Abstract summary: We propose Triple Query Former (TQF) to factorize the referring query into three specialized components.<n>Instead of relying solely on textual embeddings, our queries are dynamically constructed by integrating both linguistic cues and visual guidance.
- Score: 39.39279952650532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, query-based methods have achieved remarkable performance in Referring Video Object Segmentation (RVOS) by using textual static object queries to drive cross-modal alignment. However, these static queries are easily misled by distractors with similar appearance or motion, resulting in \emph{query selection bias}. To address this issue, we propose Triple Query Former (TQF), which factorizes the referring query into three specialized components: an appearance query for static attributes, an intra-frame interaction query for spatial relations, and an inter-frame motion query for temporal association. Instead of relying solely on textual embeddings, our queries are dynamically constructed by integrating both linguistic cues and visual guidance. Furthermore, we introduce two motion-aware aggregation modules that enhance object token representations: Intra-frame Interaction Aggregation incorporates position-aware interactions among objects within a single frame, while Inter-frame Motion Aggregation leverages trajectory-guided alignment across frames to ensure temporal coherence. Extensive experiments on multiple RVOS benchmarks demonstrate the advantages of TQF and the effectiveness of our structured query design and motion-aware aggregation modules.
Related papers
- Object-Centric Framework for Video Moment Retrieval [15.916994168542345]
Most existing moment retrieval methods rely on temporal sequences of frame-level features that primarily encode global visual and semantic information.<n>In particular temporal dynamics at the object level have been largely overlooked, limiting existing approaches in scenarios requiring object-level reasoning.<n>Our method first extracts query-relevant objects using a scene graph and then graphs from video frames to represent these objects and their relationships.<n>Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a video tracklet transformer, which models relational-temporal localization among objects over time.
arXiv Detail & Related papers (2025-12-20T17:44:53Z) - Referring Video Object Segmentation with Cross-Modality Proxy Queries [23.504655272754587]
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions.<n>Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism.<n>We propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics.
arXiv Detail & Related papers (2025-11-26T07:45:41Z) - CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos [59.391265901911005]
We propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address complex challenges by temporal-semantic reasoning.<n>CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding for each object that can be observed effortlessly among all frames (temporal)<n>Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge
arXiv Detail & Related papers (2025-05-24T07:01:31Z) - Enhanced Partially Relevant Video Retrieval through Inter- and Intra-Sample Analysis with Coherence Prediction [18.24629930062925]
Partially Relevant Video Retrieval aims to retrieve the target video that is partially relevant to a text query.<n>Existing methods coarsely align paired videos and text queries to construct the semantic space.<n>We propose a novel PRVR framework to systematically exploit inter-sample correlation and intra-sample redundancy.
arXiv Detail & Related papers (2025-04-28T09:52:46Z) - QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.<n>We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z) - Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding [23.022070084937603]
We introduce a semantics-driven search framework that reformulates selection under the paradigm of Visual Semantic-Logical Search.<n>Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics.
arXiv Detail & Related papers (2025-03-17T13:07:34Z) - Leveraging Inter-Chunk Interactions for Enhanced Retrieval in Large Language Model-Based Question Answering [12.60063463163226]
IIER captures the internal connections between document chunks by considering three types of interactions: structural, keyword, and semantic.
It identifies multiple seed nodes based on the target question and iteratively searches for relevant chunks to gather supporting evidence.
It refines the context and reasoning chain, aiding the large language model in reasoning and answer generation.
arXiv Detail & Related papers (2024-08-06T02:39:55Z) - Learning Spatial-Semantic Features for Robust Video Object Segmentation [108.045326229865]
We propose a robust video object segmentation framework that learns spatial-semantic features and discriminative object queries.<n>The proposed method achieves state-of-the-art performance on benchmark data sets, including the DAVIS 2017 test (textbf87.8%), YoutubeVOS 2019 (textbf88.1%), MOSE val (textbf74.0%), and LVOS test (textbf73.0%)
arXiv Detail & Related papers (2024-07-10T15:36:00Z) - Spatial-Temporal Multi-level Association for Video Object Segmentation [89.32226483171047]
This paper proposes spatial-temporal multi-level association, which jointly associates reference frame, test frame, and object features.
Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features.
arXiv Detail & Related papers (2024-04-09T12:44:34Z) - DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries [60.09774333024783]
We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap between the anchor and target queries.
We also introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ's potential without any additional cost.
Experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks.
arXiv Detail & Related papers (2024-03-29T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.