Fugu-MT 論文翻訳(概要): Scaling Language-Centric Omnimodal Representation Learning

論文の概要: Scaling Language-Centric Omnimodal Representation Learning

arxiv url: http://arxiv.org/abs/2510.11693v1
Date: Mon, 13 Oct 2025 17:53:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.496479
Title: Scaling Language-Centric Omnimodal Representation Learning
Title（参考訳）: 言語中心のOmnimodal Representation Learningのスケーリング
Authors: Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong,
Abstract要約: 大規模言語モデル (MLLM) を利用したマルチモーダル埋め込み手法は, 比較学習 (CL) によって微調整され, 有望な結果が得られた。この研究は、MLLMに基づくアプローチの重要な利点は、生成前訓練中に達成される暗黙の相互モーダルアライメントに起因していると主張している。我々はLCO-Embと呼ばれる言語中心のOmnimodal Embeddingフレームワークを提案する。
参考スコア（独自算出の注目度）: 26.999264997449586
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.
Abstract（参考訳）: 最近のマルチモーダルな大規模言語モデル (MLLM) を利用したマルチモーダルな埋め込み手法は, 比較学習 (CL) によって微調整されている。この研究は、MLLMベースのアプローチの重要な利点は、生成前訓練中に達成された暗黙のクロスモーダルアライメントに起因しており、そこで言語デコーダは、共有表現空間内でマルチモーダル信号を利用して単調出力を生成することを学習する。異方性およびカーネル類似性構造の解析を通じて、遅延アライメントがMLLM表現内に現れることを実証的に確認し、CLは軽量な洗練段階として機能する。この知見を生かして,LCO-Embと呼ばれる言語中心のOmnimodal Embeddingフレームワークを提案する。様々なバックボーンとベンチマークにわたる大規模な実験は、その効果を示し、モダリティを越えて最先端のパフォーマンスを達成する。さらに,ジェネレーション・表現スケーリング法 (GRSL) を定め,コントラッシブ・リファインメントによって得られた表現能力がMLLMの生成能力と正に一致していることを示す。このことから,表現能力の向上は表現の質向上に有効なパラダイムとして発展していくことが示唆された。本稿では,MLLM の生成品質を表現性能上界に正式に関連付ける GRSL の理論的説明を行い,CL 以前の連続的な生成前訓練によりモデル埋め込み能力の可能性がさらに高められることを示す。コード、モデル、リソースはhttps://github.com/LCO-Embedding/LCO-Embeddingで入手できる。

論文の概要: Scaling Language-Centric Omnimodal Representation Learning

関連論文リスト