Fugu-MT 論文翻訳(概要): Dynamic Reflections: Probing Video Representations with Text Alignment

論文の概要: Dynamic Reflections: Probing Video Representations with Text Alignment

arxiv url: http://arxiv.org/abs/2511.02767v1
Date: Tue, 04 Nov 2025 17:52:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:06.128807
Title: Dynamic Reflections: Probing Video Representations with Text Alignment
Title（参考訳）: ダイナミックリフレクション:テキストアライメントによるビデオ表現の提案
Authors: Tyler Zhu, Tengda Han, Leonidas Guibas, Viorica Pătrăucean, Maks Ovsjanikov,
Abstract要約: クロスモーダルアライメントは、テスト時に提供されるビジュアル(静的画像対マルチフレームビデオ)とテキスト(単一キャプション対コレクション)の両方の豊かさに依存します。本研究では、この挙動を捉え、経験的観測に対して顕著な予測力を示すパラメトリックテストタイムスケーリング法を提案する。
参考スコア（独自算出の注目度）: 36.66874523368293
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/
Abstract（参考訳）: 異なるモダリティからの表現のアライメントは、様々なデータ型にまたがる異なるエンコーダの構造的類似性と下流機能に関する洞察を提供するために、最近示されている。画像とテキストの整合性には大きな進歩があったが、ビデオデータの時間的性質は、この文脈では明らかに解明されていない。本研究では,現代ビデオエンコーダと言語エンコーダの能力を検証し,ビデオテキストのアライメントに関する最初の包括的研究を行う。我々の発見はいくつかの重要な洞察を浮き彫りにした。まず,テスト時に提供される画像(静止画像対マルチフレームビデオ)とテキスト(単一キャプション対コレクション)データのリッチさ,特に最先端のビデオエンコーダを使用する場合の相互アライメントが重要であることを示す。本研究では、この挙動を捉え、経験的観測に対して顕著な予測力を示すパラメトリックテストタイムスケーリング法を提案する。次に,意味的および非意味的下流タスクにおける意味的アライメントと性能の相関について検討し,テキストエンコーダに対する強いアライメントが汎用的な映像表現と理解に結びついていることを示す。最後に、時間的推論とクロスモーダルアライメントを関連付け、視覚と言語モデルに挑戦的なテストベッドを提供する。全体として、ビデオテキストアライメントは、時空間データに対する異なるエンコーダの表現力を調査するための情報的ゼロショット方式として導入されている。プロジェクトページはhttps://video-prh.github.io/にある。

論文の概要: Dynamic Reflections: Probing Video Representations with Text Alignment

関連論文リスト