Fugu-MT 論文翻訳(概要): OVG-HQ: Online Video Grounding with Hybrid-modal Queries

論文の概要: OVG-HQ: Online Video Grounding with Hybrid-modal Queries

arxiv url: http://arxiv.org/abs/2508.11903v1
Date: Sat, 16 Aug 2025 04:21:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.441063
Title: OVG-HQ: Online Video Grounding with Hybrid-modal Queries
Title（参考訳）: OVG-HQ: ハイブリッドモダルクエリによるオンラインビデオグラウンド
Authors: Runhao Zeng, Jiaqi Mao, Minghao Lai, Minh Hieu Phan, Yanjie Dong, Wei Wang, Qi Chen, Xiping Hu,
Abstract要約: ビデオグラウンドタスクは、クエリに基づいて、通常テキスト形式で、ビデオ内の特定のモーメントを見つけることに焦点を当てる。従来のVGは、ビデオのストリーミングや、ビジュアルなキューを使ったクエリなど、いくつかのシナリオで苦労している。テキスト,画像,ビデオセグメント,およびそれらの組み合わせを用いたオンラインセグメントのローカライズを可能にする,OVG-HQ(Online Video Grounding with Hybrid-modal Queries)というタスクを提案する。
参考スコア（独自算出の注目度）: 19.937584866244038
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that retain previously learned knowledge to enhance current decision and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to effectively handle hybrid-modal queries. Due to the lack of suitable datasets, we construct QVHighlights-Unify, an expanded dataset with multi-modal queries. Besides, since offline metrics overlook prediction timeliness, we adapt them to the online setting, introducing oR@n, IoU=m, and online mean Average Precision (omAP) to evaluate both accuracy and efficiency. Experiments show that our OVG-HQ-Unify outperforms existing models, offering a robust solution for online, hybrid-modal video grounding. Source code and datasets are available at https://github.com/maojiaqi2324/OVG-HQ.
Abstract（参考訳）: ビデオグラウンド(VG)タスクは、クエリに基づいて、通常テキスト形式で、ビデオ内の特定のモーメントを特定することに焦点を当てる。しかし、従来のVGは、ビデオのストリーミングやビジュアルキューを使ったクエリといったいくつかのシナリオで苦労している。このギャップを埋めるために、テキスト、画像、ビデオセグメント、およびそれらの組み合わせを用いたオンラインセグメントローカライズを可能にする、OVG-HQ(Online Video Grounding with Hybrid-modal Queries)という新しいタスクを提案する。このタスクは、オンライン設定における限られたコンテキストとトレーニング中のモダリティの不均衡という2つの新しい課題をもたらす。そこで本研究では,従来の学習知識を保持できるPMB(Parametric Memory Block)を備えた統合フレームワークであるOVG-HQ-Unifyと,非支配的なモダリティの学習を導くクロスモーダル蒸留戦略を提案する。この設計により、単一モデルはハイブリッドモーダルクエリを効果的に処理できる。適切なデータセットがないため、マルチモーダルクエリを備えた拡張データセットであるQVHighlights-Unifyを構築した。さらに、オフラインメトリクスが予測タイムラインを見渡すので、オンライン設定に適応し、oR@n、IoU=m、オンライン平均平均精度(omAP)を導入し、精度と効率の両方を評価します。実験によると、OVG-HQ-Unifyは既存のモデルより優れており、オンラインのハイブリッドモダルビデオグラウンドに堅牢なソリューションを提供する。ソースコードとデータセットはhttps://github.com/maojiaqi2324/OVG-HQで入手できる。

論文の概要: OVG-HQ: Online Video Grounding with Hybrid-modal Queries

関連論文リスト