Fugu-MT 論文翻訳(概要): Towards Visual Query Segmentation in the Wild

論文の概要: Towards Visual Query Segmentation in the Wild

arxiv url: http://arxiv.org/abs/2603.08898v1
Date: Mon, 09 Mar 2026 20:09:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:23.80806
Title: Towards Visual Query Segmentation in the Wild
Title（参考訳）: 野生でのビジュアルクエリセグメンテーションに向けて
Authors: Bing Fan, Minghao Li, Hanzhi Zhang, Shaohua Dong, Naga Prudhvi Mareedu, Weishi Shi, Yunhe Feng, Yan Huang, Heng Fan,
Abstract要約: ビジュアルクエリローカライゼーション(VQL)の新しいパラダイムであるビジュアルクエリセグメンテーション(VQS)を導入する。 VQSは、外部のビジュアルクエリを考慮し、未トリミングされたビデオに注目するオブジェクトのピクセルレベルのすべての発生をセグメンテーションすることを目的としている。本稿では,VQS専用の大規模ベンチマークであるVQS-4Kを提案する。
参考スコア（独自算出の注目度）: 23.644748674190026
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce visual query segmentation (VQS), a new paradigm of visual query localization (VQL) that aims to segment all pixel-level occurrences of an object of interest in an untrimmed video, given an external visual query. Compared to existing VQL locating only the last appearance of a target using bounding boxes, VQS enables more comprehensive (i.e., all object occurrences) and precise (i.e., pixel-level masks) localization, making it more practical for real-world scenarios. To foster research on this task, we present VQS-4K, a large-scale benchmark dedicated to VQS. Specifically, VQS-4K contains 4,111 videos with more than 1.3 million frames and covers a diverse set of 222 object categories. Each video is paired with a visual query defined by a frame outside the search video and its target mask, and annotated with spatial-temporal masklets corresponding to the queried target. To ensure high quality, all videos in VQS-4K are manually labeled with meticulous inspection and iterative refinement. To the best of our knowledge, VQS-4K is the first benchmark specifically designed for VQS. Furthermore, to stimulate future research, we present a simple yet effective method, named VQ-SAM, which extends SAM 2 by leveraging target-specific and background distractor cues from the video to progressively evolve the memory through a novel multi-stage framework with an adaptive memory generation (AMG) module for VQS, significantly improving the performance. In our extensive experiments on VQS-4K, VQ-SAM achieves promising results and surpasses all existing approaches, demonstrating its effectiveness. With the proposed VQS-4K and VQ-SAM, we expect to go beyond the current VQL paradigm and inspire more future research and practical applications on VQS. Our benchmark, code, and results will be made publicly available.
Abstract（参考訳）: 本稿では,視覚的クエリローカライゼーション(VQL)の新たなパラダイムである視覚的クエリセグメンテーション(VQS)を紹介する。既存のVQLがバウンディングボックスを使用してターゲットの最後の外観だけを配置しているのに対し、VQSはより包括的で正確な(ピクセルレベルのマスク)ローカライゼーションを可能にし、現実世界のシナリオでより実用的である。本稿では,VQS専用の大規模ベンチマークであるVQS-4Kを提案する。具体的には、VQS-4Kは130万フレーム以上の4,111本のビデオを含み、222のオブジェクトカテゴリをカバーしている。各ビデオは、検索ビデオの外のフレームとそのターゲットマスクで定義されたビジュアルクエリとペアリングされ、クエリされたターゲットに対応する空間時間マスクレットと注釈付けされる。高品質を確保するため、VQS-4Kの全てのビデオは手動で精細な検査と反復的な精査でラベル付けされる。我々の知る限りでは、VQS-4KはVQS用に特別に設計された最初のベンチマークである。さらに,本稿では,VQS用適応メモリ生成(AMG)モジュールを用いた新たな多段階フレームワークにより,映像からターゲット固有および背景乱れを段階的に進化させ,SAM2を拡張した,VQ-SAMというシンプルな手法を提案する。 VQS-4Kに関する広範な実験において、VQ-SAMは有望な結果を達成し、既存のすべてのアプローチを超越し、その効果を実証した。提案されているVQS-4KとVQ-SAMでは、現在のVQLパラダイムを超えて、VQSに関するより将来の研究と実践的応用を刺激することを期待しています。私たちのベンチマーク、コード、結果は公開されます。

論文の概要: Towards Visual Query Segmentation in the Wild

関連論文リスト