Fugu-MT 論文翻訳(概要): Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment

論文の概要: Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment

arxiv url: http://arxiv.org/abs/2510.03268v2
Date: Tue, 07 Oct 2025 18:46:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 14:21:18.185768
Title: Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment
Title（参考訳）: マルチモーダルコントラスト学習におけるモダリティギャップの解読:収束表現からペアワイズアライメントへ
Authors: Lingjie Yi, Raphael Douady, Chao Chen,
Abstract要約: マルチモーダルコントラスト学習は、異なるモーダルからのデータを共有埋め込み空間に埋め込むことを目的としている。実験的な証拠は異なるモダリティの表現が埋め込み空間の完全に別々の領域を占めることを示している本稿では,MCLの収束最適表現とトレーニング最適化時のモーダリティアライメントを解析するための最初の理論的枠組みを紹介する。
参考スコア（独自算出の注目度）: 6.276865284763687
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal contrastive learning (MCL) aims to embed data from different modalities in a shared embedding space. However, empirical evidence shows that representations from different modalities occupy completely separate regions of embedding space, a phenomenon referred to as the modality gap. Moreover, experimental findings on how the size of the modality gap influences downstream performance are inconsistent. These observations raise two key questions: (1) What causes the modality gap? (2) How does it affect downstream tasks? To address these questions, this paper introduces the first theoretical framework for analyzing the convergent optimal representations of MCL and the modality alignment when training is optimized. Specifically, we prove that without any constraint or under the cone constraint, the modality gap converges to zero. Under the subspace constraint (i.e., representations of two modalities fall into two distinct hyperplanes due to dimension collapse), the modality gap converges to the smallest angle between the two hyperplanes. This result identifies \emph{dimension collapse} as the fundamental origin of the modality gap. Furthermore, our theorems demonstrate that paired samples cannot be perfectly aligned under the subspace constraint. The modality gap influences downstream performance by affecting the alignment between sample pairs. We prove that, in this case, perfect alignment between two modalities can still be achieved via two ways: hyperplane rotation and shared space projection.
Abstract（参考訳）: マルチモーダルコントラスト学習(MCL)は、異なるモーダルのデータを共有埋め込み空間に埋め込むことを目的としている。しかし、実験的な証拠は、異なるモジュラリティの表現が埋め込み空間の完全に独立した領域を占有していることを示し、これはモジュラリティギャップと呼ばれる現象である。さらに、モダリティギャップの大きさが下流性能にどのように影響するかに関する実験結果も一致しない。これらの観察は、(1)モダリティギャップの原因は何か? (2)下流タスクにはどのように影響しますか? そこで本研究では,MCLの収束最適表現とトレーニング最適化時のモーダリティアライメントを解析するための最初の理論的枠組みを提案する。具体的には、いかなる制約もコーンの制約もなければ、モダリティギャップは 0 に収束することを示す。部分空間の制約(すなわち、2つのモジュラリティの表現は次元の崩壊によって2つの異なる超平面に分解される)の下で、モダリティギャップは2つの超平面の間の最小の角度に収束する。この結果は、モダリティギャップの根源として 'emph{dimension collapse} を特定できる。さらに、我々の定理は、ペア化されたサンプルは部分空間の制約の下で完全に整列できないことを示す。モダリティギャップは、サンプルペア間のアライメントに影響を与えることにより、下流のパフォーマンスに影響を与える。この場合、2つのモード間の完全なアライメントは、超平面回転と共有空間射影という2つの方法によって達成可能であることを証明している。

論文の概要: Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment

関連論文リスト