Fugu-MT 論文翻訳(概要): LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

論文の概要: LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

arxiv url: http://arxiv.org/abs/2512.16891v1
Date: Thu, 18 Dec 2025 18:52:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-19 18:10:32.232095
Title: LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation
Title（参考訳）: LinkedOut:次世代ビデオレコメンデーションのためのビデオLLMから世界の知識表現をリンクする
Authors: Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen, Yunpeng Xu, Yun Fu,
Abstract要約: ビデオ大言語モデル(VLLM)は、インターネット規模のデータの事前学習を通じて、世界認知ビデオ理解を解放する。本稿では,ビデオから直接VLLM世界の知識を抽出し,高速な推論を可能にするLinkedOutを提案する。我々は、リッチなVLLM機能から適切な抽象化レベルを選択し、パーソナライズ、解釈可能、低レイテンシのレコメンデーションを可能にする層間知識融合MoEを導入する。
参考スコア（独自算出の注目度）: 32.57236582010967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.
Abstract（参考訳）: ビデオ大言語モデル (VLLMs) は、インターネット規模のデータの事前学習を通じて、世界認知のビデオ理解を解き明かし、映画分析やビデオ質問応答といったタスクですでに約束されている。しかし、実際のシステムはマルチビデオ入力、軽量バックボーン、低遅延シーケンシャル推論、迅速な応答を必要とするため、ビデオレコメンデーションのような下流タスクにVLLMをデプロイすることは依然として困難である。実際に,(1)デコードのみの生成は逐次推論のレイテンシが高く,(2)典型的インタフェースはマルチビデオ入力をサポートしておらず,(3)言語への出力の制約は,下流の視覚タスクにおいて重要となる細粒度の視覚的詳細を破棄する。これらの制限は、世界の知識を活用しながらピクセルレベルの詳細を保存する表現が存在しないことに起因すると我々は主張する。高速な推論を可能にするためにビデオから直接VLLMの世界知識を抽出し、マルチビデオ履歴をサポートし、言語ボトルネックを除去するLinkedOutを提案する。 LinkedOutは、VLLMを使用して生フレームからセマンティックグラウンドで知識を意識したトークンを抽出し、プロンプト可能なクエリとオプションの補助モダリティによってガイドする。我々は、豊富なVLLM機能から適切な抽象化レベルを選択し、パーソナライズ、解釈可能、低レイテンシのレコメンデーションを可能にする層間知識融合MoEを導入する。われわれの知る限り、LinkedOutはVLLMベースの最初のビデオレコメンデーション手法であり、手作りのラベルを使わずに生のフレーム上で動作し、標準ベンチマークで最先端の結果を得る。解釈可能性の研究と改善により、層多様性と層融合の利点が確認され、VLLMの世界知識を十分に活用する実践的な経路と、リコメンデーションのような下流視覚タスクに対する視覚的推論が指摘される。

論文の概要: LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

関連論文リスト