Fugu-MT 論文翻訳(概要): Mitigating Query Selection Bias in Referring Video Object Segmentation

論文の概要: Mitigating Query Selection Bias in Referring Video Object Segmentation

arxiv url: http://arxiv.org/abs/2509.13722v1
Date: Wed, 17 Sep 2025 06:17:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-18 18:41:50.735372
Title: Mitigating Query Selection Bias in Referring Video Object Segmentation
Title（参考訳）: ビデオオブジェクトセグメンテーションの参照におけるクエリ選択バイアスの緩和
Authors: Dingwei Zhang, Dong Zhang, Jinhui Tang,
Abstract要約: 本稿では,参照クエリを3つの特別なコンポーネントに分解するTriple Query former (TQF)を提案する。テキストの埋め込みにのみ依存するのではなく、我々のクエリは言語的手がかりと視覚的ガイダンスの両方を統合することで動的に構築されます。
参考スコア（独自算出の注目度）: 39.39279952650532
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, query-based methods have achieved remarkable performance in Referring Video Object Segmentation (RVOS) by using textual static object queries to drive cross-modal alignment. However, these static queries are easily misled by distractors with similar appearance or motion, resulting in \emph{query selection bias}. To address this issue, we propose Triple Query Former (TQF), which factorizes the referring query into three specialized components: an appearance query for static attributes, an intra-frame interaction query for spatial relations, and an inter-frame motion query for temporal association. Instead of relying solely on textual embeddings, our queries are dynamically constructed by integrating both linguistic cues and visual guidance. Furthermore, we introduce two motion-aware aggregation modules that enhance object token representations: Intra-frame Interaction Aggregation incorporates position-aware interactions among objects within a single frame, while Inter-frame Motion Aggregation leverages trajectory-guided alignment across frames to ensure temporal coherence. Extensive experiments on multiple RVOS benchmarks demonstrate the advantages of TQF and the effectiveness of our structured query design and motion-aware aggregation modules.
Abstract（参考訳）: 近年,ビデオオブジェクトセグメンテーション(RVOS)の参照において,テキスト静的なオブジェクトクエリを用いてクロスモーダルアライメントを駆動することで,クエリベースの手法が顕著なパフォーマンスを実現している。しかし、これらの静的クエリは、外観や動きに類似した邪魔者によって容易に誤解され、結果として \emph{query selection bias} となる。この問題に対処するために,静的属性の出現クエリ,空間関係のフレーム内インタラクションクエリ,時間的関連のためのフレーム間モーションクエリという,参照クエリを3つの特別なコンポーネントに分解するTriple Query former (TQF)を提案する。テキストの埋め込みにのみ依存するのではなく、我々のクエリは言語的手がかりと視覚的ガイダンスの両方を統合することで動的に構築されます。フレーム間相互作用 Aggregation は1フレーム内のオブジェクト間の位置認識相互作用を包含する一方、フレーム間移動 Aggregation はフレーム間の軌道誘導アライメントを活用して時間的コヒーレンスを確保する。複数のRVOSベンチマークの大規模な実験は、TQFの利点と構造化クエリ設計とモーションアウェアアグリゲーションモジュールの有効性を実証している。

論文の概要: Mitigating Query Selection Bias in Referring Video Object Segmentation

関連論文リスト