Fugu-MT 論文翻訳(概要): HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

論文の概要: HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

arxiv url: http://arxiv.org/abs/2603.06732v1
Date: Fri, 06 Mar 2026 04:10:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:13.003359
Title: HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos
Title（参考訳）: HERO:ビデオにおける開語彙時間文接地のための階層的埋め込み制限
Authors: Tingting Han, Xinsong Tao, Yufei Yin, Min Tan, Sicheng Zhao, Zhou Yu,
Abstract要約: ビデオにおける時間的センテンスグラウンドは、与えられた自然言語クエリに対応するビデオのセグメントを時間的にローカライズすることを目的としている。従来のアプローチはクローズド・ボキャブラリ・セッティングの下で動作し、新しい言語表現や多様な言語表現を含む現実世界のクエリに一般化する能力を制限する。そこで我々は,Open-Vocabulary TSGV (OV-TSGV)タスクを導入し,現実的な語彙シフトとパラフレーズ変動をシミュレートする最初の専用ベンチマークを構築した。
参考スコア（独自算出の注目度）: 29.677489003907095
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks--Charades-OV and ActivityNet-OV--that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO(Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement. HERO jointly models multi-level semantics and enhances video-language alignment via semantic-guided visual filtering and contrastive masked text refinement. Extensive experiments on both standard and open vocabulary benchmarks demonstrate that HERO consistently surpasses state-of-the-art methods, particularly under open-vocabulary scenarios, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.
Abstract（参考訳）: TSGV(Temporal Sentence Grounding in Videos)は、ある自然言語クエリに対応するビデオのセグメントを時間的にローカライズすることを目的としている。最近の進歩にもかかわらず、ほとんどの既存のTSGVアプローチはクローズド・ボキャブラリ・セッティングの下で動作し、新しい言語表現や多様な言語表現を含む現実世界のクエリに一般化する能力を制限する。この重要なギャップを埋めるために、我々はOpen-Vocabulary TSGV (OV-TSGV)タスクを導入し、現実的な語彙シフトとパラフレーズ変動をシミュレートする最初の専用ベンチマーク-Charades-OVとActivityNet-OVを構築する。これらのベンチマークは、見いだされたトレーニング概念を超えたモデル一般化の体系的評価を促進する。 OV-TSGVに取り組むために,階層型言語埋め込みを活用し,並列なクロスモーダル改良を行う統一フレームワークHERO(Hierarchical Embedding-Refinement for Open-Vocabulary Grounding)を提案する。 HEROは、マルチレベルセマンティクスを共同でモデル化し、セマンティックガイド付きビジュアルフィルタリングとコントラッシブマスク付きテキストリファインメントによるビデオ言語アライメントを強化する。標準ボキャブラリベンチマークおよびオープンボキャブラリベンチマークの広範な実験により、HEROは最先端の手法、特にオープンボキャブラリシナリオにおいて一貫して超越し、その強力な一般化能力を検証し、新しい研究方向としてのOV-TSGVの重要性を裏付けることを示した。

論文の概要: HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

関連論文リスト