Fugu-MT 論文翻訳(概要): Show Me When and Where: Towards Referring Video Object Segmentation in the Wild

論文の概要: Show Me When and Where: Towards Referring Video Object Segmentation in the Wild

arxiv url: http://arxiv.org/abs/2603.14300v1
Date: Sun, 15 Mar 2026 09:30:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.731205
Title: Show Me When and Where: Towards Referring Video Object Segmentation in the Wild
Title（参考訳）: 野生での動画オブジェクトのセグメンテーションを振り返って
Authors: Mingqi Gao, Jinyu Yang, Jingnan Luo, Xiantong Zhen, Jungong Han, Giovanni Montana, Feng Zheng,
Abstract要約: そこで本研究では,次世代RVOSに向けた新たな設定について紹介する。我々の新しいベンチマークでは、RVOSメソッドに挑戦して、ビデオにオブジェクトが現れる場所だけでなく、いつ現れるかを示す。われわれのYoURVOSデータセットは命令型ベンチマークを提供しており、実用化のためのRVOSメソッドの進歩を推し進める。
参考スコア（独自算出の注目度）: 98.87931411432106
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring video object segmentation (RVOS) has recently generated great popularity in computer vision due to its widespread applications. Existing RVOS setting contains elaborately trimmed videos, with text-referred objects always appearing in all frames, which however fail to fully reflect the realistic challenges of this task. This simplified setting requires RVOS methods to only predict where objects, with no need to show when the objects appear. In this work, we introduce a new setting towards in-the-wild RVOS. To this end, we collect a new benchmark dataset using Youtube Untrimmed videos for RVOS - YoURVOS, which contains 1,120 in-the-wild videos with 7 times more duration and scenes than existing datasets. Our new benchmark challenges RVOS methods to show not only where but also when objects appear in videos. To set a baseline, we propose Object-level Multimodal TransFormers (OMFormer) to tackle the challenges, which are characterized by encoding object-level multimodal interactions for efficient and global spatial-temporal localisation. We demonstrate that previous VOS methods struggle on our YoURVOS benchmark, especially with the increase of target-absent frames, while our OMFormer consistently performs well. Our YoURVOS dataset offers an imperative benchmark, which will push forward the advancement of RVOS methods for practical applications.
Abstract（参考訳）: ビデオオブジェクトセグメンテーション(RVOS)は近年,コンピュータビジョンにおいて広く普及している。既存のRVOS設定には精巧にトリミングされたビデオが含まれており、テキスト参照されたオブジェクトは常にすべてのフレームに現れるが、このタスクの現実的な課題を完全に反映することができない。この単純化された設定では、RVOSメソッドはオブジェクトがいつ現れるかを示す必要がなく、どこにしかオブジェクトが現れるかを予測する必要がある。そこで本研究では,Wild RVOS の新たな設定について紹介する。この目的のために、Youtube Untrimmed video for RVOS - YoURVOSという、既存のデータセットの7倍の時間とシーンを持つ、1,120のアプリ内ビデオを含む、新しいベンチマークデータセットを収集しました。我々の新しいベンチマークでは、RVOSメソッドに挑戦して、ビデオにオブジェクトが現れる場所だけでなく、いつ現れるかを示す。そこで本研究では,オブジェクトレベルのマルチモーダルなインタラクションを,効率的かつグローバルな空間時間的ローカライゼーションのために符号化することが特徴であるオブジェクトレベルのマルチモーダルトランスフォーマー(OMFormer)を提案する。従来の VOS メソッドは YoURVOS のベンチマークで,特にOMFormer が一貫して動作するのに対して,対象フレームの増加に苦慮していることを示す。われわれのYoURVOSデータセットは命令型ベンチマークを提供しており、実用化のためのRVOSメソッドの進歩を推し進める。

論文の概要: Show Me When and Where: Towards Referring Video Object Segmentation in the Wild

関連論文リスト