Fugu-MT 論文翻訳(概要): Rethinking Video-Language Model from the Language Input Perspective

論文の概要: Rethinking Video-Language Model from the Language Input Perspective

arxiv url: http://arxiv.org/abs/2605.27920v1
Date: Wed, 27 May 2026 03:47:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.730654
Title: Rethinking Video-Language Model from the Language Input Perspective
Title（参考訳）: 言語入力から見たビデオ言語モデルの再考
Authors: Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, Daizong Liu,
Abstract要約: VLM(Video-Language Models)は、ビデオとテキストのギャップを埋める上で、重要かつ難しい技術となっている。本稿では,ビデオやテキストを完全にブリッジする様々なVLM方式のプラグイン・アンド・プレイ・フレームワークを提案する。
参考スコア（独自算出の注目度）: 57.766724144002346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.
Abstract（参考訳）: 大規模言語モデルの波によって駆動されるVLM(Video-Language Models)は、ビデオとテキストのギャップを埋める上で、重要かつ困難な技術となっている。以前のVLMの作業は大きな進歩を遂げたものの、ほとんど全てのテキストは特定のテンプレートによって事前に定義されていると暗黙的に仮定している。現実世界の応用では、そのような厳密な仮定は満足できない。 1)すべてのテキストの事前定義は、非常に時間がかかり、労力がかかります。 2) これらの事前定義されたテキスト入力は、あまりに制限的であり、ユーザフレンドリであり、アプリケーションを制限する。ビデオ入力が与えられたとき、同様の意味を持つがテンプレートが異なるテキストが様々なパフォーマンスをもたらすことが観察された。そこで本稿では,ビデオやテキストを完全にブリッジする様々なVLM方式のプラグイン・アンド・プレイ・フレームワークを提案する。具体的には、最初に元のテキストから正と負のテキストを生成し、特定のテキストコンポーネントをターゲットとします。そこで本研究では, 属性に基づくテキスト推論手法を提案し, 生成したテキストの微粒なテキスト意味を抽出する。最後に, 自己重み付き損失を設計し, 動画を指針としてクロスモーダルブリッジを行う。実験により,提案手法は,最新のVLMの性能を効果的に向上するために,プラグイン・アンド・プレイ・モジュールとして機能することを示す。

論文の概要: Rethinking Video-Language Model from the Language Input Perspective

関連論文リスト