Fugu-MT 論文翻訳(概要): 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA

論文の概要: 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA

arxiv url: http://arxiv.org/abs/2604.23935v1
Date: Mon, 27 Apr 2026 01:19:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.684483
Title: 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA
Title（参考訳）: 第5回PVUW MeViSオーディオトラック:ASR-SaSa2VA
Authors: Zhiyu Wang, Xudong Kang, Shutao Li,
Abstract要約: 本稿では、音声誘導ビデオセグメンテーションのためのリソース効率の高いフレームワークであるASR-SaSa2VAを提案する。第5回PVUWチャレンジ(MeViS-v2-Audioトラック)で最終スコア80.7を獲得し,第2位を獲得した。
参考スコア（独自算出の注目度）: 47.992210130090065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-based video object segmentation aims to locate and segment objects in videos conditioned on audio cues, requiring precise understanding of both appearance and motion. Recent audio-driven video segmentation methods extend MLLMs by fusing audio and visual features for end-to-end localization. Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio-video datasets. To address these challenges, we present ASR-SaSaSa2VA, a resource-efficient framework for audio-guided video segmentation. The key idea is to convert audio inputs into textual motion descriptions via automatic speech recognition (ASR) models and then leverage pre-trained text-based referring video segmentation models (e.g., SaSaSa2VA) for pixel-level predictions. To further enhance robustness, we incorporate a no-target expression detection module, implemented by a fine-tuned audio-based MLLM, which filters out audio clips that do not refer to any target object. This design allows the system to exploit strong pre-trained models while effectively handling ambiguous or irrelevant audio inputs. Our approach achieves a final score of 80.7 in the 5th PVUW Challenge (MeViS-v2-Audio track), earning the second-place ranking.
Abstract（参考訳）: オーディオベースのビデオオブジェクトセグメンテーションは、オーディオキューで条件付けられたビデオ内のオブジェクトの特定とセグメンテーションを目的としており、外観と動きの両方を正確に理解する必要がある。最近の音声駆動ビデオセグメンテーション手法は、音声と視覚機能を融合してMLLMを拡張し、エンドツーエンドのローカライゼーションを実現している。それらの約束にもかかわらず、これらのアプローチは計算集約的で、時間的オーディオキューと動的なビデオコンテンツとの整合に苦慮し、大きなペアのオーディオビデオデータセットに依存している。これらの課題に対処するために、オーディオ誘導ビデオセグメンテーションのためのリソース効率の高いフレームワークであるASR-SaSa2VAを提案する。鍵となるアイデアは、音声入力を自動音声認識(ASR)モデルでテキストモーション記述に変換し、事前訓練されたテキストベースの参照ビデオセグメンテーションモデル(例:SaSaSa2VA)をピクセルレベルの予測に利用することである。さらにロバスト性を高めるために、ターゲットオブジェクトを参照しない音声クリップをフィルタリングする、微調整された音声ベースMLLMによって実装されたno-target式検出モジュールを組み込んだ。この設計により、システムは強い事前訓練されたモデルを利用して、曖昧または無関係なオーディオ入力を効果的に処理できる。第5回PVUWチャレンジ(MeViS-v2-Audioトラック)で最終スコア80.7を獲得し,第2位を獲得した。

論文の概要: 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA

関連論文リスト