Fugu-MT 論文翻訳(概要): DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

論文の概要: DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

arxiv url: http://arxiv.org/abs/2606.19062v1
Date: Wed, 17 Jun 2026 13:35:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:51.180792
Title: DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval
Title（参考訳）: DREAM:クロスモーダル検索のための2目的符号化による視覚言語モデルの拡張
Authors: Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik,
Abstract要約: DREAM: Dual-path Representation Enhancement and Alignment Modelを紹介する。空間情報と時間情報を統合した階層型視覚エンコーダを設計する。広範に使われているMSRVTT, MSVD, LSMDCベンチマークデータセットの総合的な評価を通じてDREAMを検証する。
参考スコア（独自算出の注目度）: 8.127699016544822
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.
Abstract（参考訳）: 今日のメディア主導の世界では、監視、教育、エンターテイメントといった分野におけるビデオコンテンツの指数関数的な成長が、自然言語クエリによる意味論的ビデオの検索をますます重要にしている。初期のビデオ検索システムは手作りの機能や浅いクロスモーダルマッピングに依存しており、複雑なセマンティクスや時間的ダイナミクスを捉える能力に制限があった。大規模視覚言語モデルでは、相互モーダルアライメントが改善されているが、微粒な時間依存性やニュアンスド言語構造をモデル化する上での課題は残る。本稿では,DREAM: Dual-path Representation Enhancement and Alignment Modelを紹介する。 DREAMには、マスキングと置換言語モデリングの目的を組み合わせたハイブリッド言語モデリング戦略が組み込まれており、局所言語意味論とグローバル言語意味論の両方を捉えている。視覚面では,多段階のトークンインタラクションと粗い注目改善により空間的・時間的情報を統合した階層型視覚エンコーダを設計する。 DREAMはMSRVTT、MSVD、LSMDCのベンチマークデータセットを総合的に評価し、それぞれ49.4%、49.7%、27.3%の新しい最先端R1スコアを達成している。質的な分析により、フレーム全体のコヒーレントな注意を保ち、複雑なクエリをダイナミックなビデオコンテンツと整合させることができる。これらの知見は、階層的注意と二重目的のテキストモデリングが、頑健でコンテキスト対応のビデオ検索を可能にすることの有効性を浮き彫りにして、クロスモーダル表現学習の進歩に向けた今後の研究の道を開くものである。

論文の概要: DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

関連論文リスト