Fugu-MT 論文翻訳(概要): Negate or Embrace: On How Misalignment Shapes Multimodal Representation Learning

論文の概要: Negate or Embrace: On How Misalignment Shapes Multimodal Representation Learning

arxiv url: http://arxiv.org/abs/2504.10143v2
Date: Wed, 16 Apr 2025 05:22:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-04-24 12:50:05.372641
Title: Negate or Embrace: On How Misalignment Shapes Multimodal Representation Learning
Title（参考訳）: ネゲート・エンブレス:マルチモーダル表現学習におけるミスアライメントの形状について
Authors: Yichao Cai, Yuhang Liu, Erdun Gao, Tianjiao Jiang, Zhen Zhang, Anton van den Hengel, Javen Qinfeng Shi,
Abstract要約: マルチモーダル表現学習は、モダリティ間でキューを整列させることにより、強力な表現を学習することを目的としている。最近の研究では、実世界のデータセットがしばしば不一致を示すことが明らかになっている。
参考スコア（独自算出の注目度）: 37.29274397631946
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Multimodal representation learning, exemplified by multimodal contrastive learning (MMCL) using image-text pairs, aims to learn powerful representations by aligning cues across modalities. This approach relies on the core assumption that the exemplar image-text pairs constitute two representations of an identical concept. However, recent research has revealed that real-world datasets often exhibit misalignment. There are two distinct viewpoints on how to address this issue: one suggests mitigating the misalignment, and the other leveraging it. We seek here to reconcile these seemingly opposing perspectives, and to provide a practical guide for practitioners. Using latent variable models we thus formalize misalignment by introducing two specific mechanisms: selection bias, where some semantic variables are missing, and perturbation bias, where semantic variables are distorted -- both affecting latent variables shared across modalities. Our theoretical analysis demonstrates that, under mild assumptions, the representations learned by MMCL capture exactly the information related to the subset of the semantic variables invariant to selection and perturbation biases. This provides a unified perspective for understanding misalignment. Based on this, we further offer actionable insights into how misalignment should inform the design of real-world ML systems. We validate our theoretical findings through extensive empirical studies on both synthetic data and real image-text datasets, shedding light on the nuanced impact of misalignment on multimodal representation learning.
Abstract（参考訳）: 画像テキストペアを用いたマルチモーダル・コントラッシブ・ラーニング(MMCL)によって実証されたマルチモーダル表現学習は、モーダル性にまたがるキューを整列させることにより、強力な表現を学習することを目的としている。このアプローチは、模範的な画像とテキストのペアが同一の概念の2つの表現を構成するというコア仮定に依存している。しかし、最近の研究では、実世界のデータセットがしばしば不一致を示すことが明らかになっている。この問題に対処する方法には2つの異なる視点がある。ここでは、対立すると思われる視点を整理し、実践者のための実践的なガイドを提供する。したがって、潜時変数モデルを使用することで、選択バイアス(意味変数が欠落している部分)と摂動バイアス(意味変数が歪んでいる部分)という2つの特定のメカニズムを導入することで、誤調整を形式化する。我々の理論的分析は、軽微な仮定の下で、MMCLが学習した表現は、選択や摂動バイアスに不変な意味変数のサブセットに関連する情報を正確に捉えていることを示している。これは、誤認識を理解するための統一された視点を提供する。これに基づいて、現実のMLシステムの設計に不適応がどのような影響を及ぼすべきかについて、実用的な洞察を提供する。我々は、合成データと実画像テキストデータセットの両方に関する広範な実証的研究を通じて理論的な知見を検証し、誤認識の微妙な影響がマルチモーダル表現学習に与える影響について光を当てる。

関連論文リスト

Understanding the Emergence of Multimodal Representation Alignment [22.81361409729974]
最近の研究のラインでは、スケールとパフォーマンスが増大する独立に訓練されたユニモーダルモデル同士が暗黙的に一致していることが判明している。本研究では、アライメントの出現とタスクパフォーマンスとの関係が、いくつかの重要なデータ特性に依存することを示す。我々の研究結果は、アライメントは普遍的に有益ではなく、データセットやタスクによってパフォーマンスへの影響が変わることを示唆している。
論文参考訳（メタデータ） (2025-02-22T16:27:31Z)
Towards a Learning Theory of Representation Alignment [12.166663160280056]
表現アライメントに対する学習理論的な視点を提案する。この結果は, 表現アライメントを学習理論問題としてキャストする第一歩と見なすことができる。
論文参考訳（メタデータ） (2025-02-19T19:09:14Z)
The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities [23.188014611990152]
本稿では, 対比表現の幾何学的解釈と確率論的解釈について述べる。これらの表現が確率的グラフィカルモデルと同じ推論の多くにどのように答えるかを示す。分析では、事前学習されたコントラストモデルを用いた設定でのコントラスト表現と、強化学習における言語あいまいさの2つの新しい方法を提案する。
論文参考訳（メタデータ） (2025-01-20T08:10:15Z)
The Common Stability Mechanism behind most Self-Supervised Learning Approaches [64.40701218561921]
自己指導型学習手法の安定性のメカニズムを説明するための枠組みを提供する。我々は,BYOL,SWAV,SimSiam,Barlow Twins,DINOなどの非コントラスト技術であるSimCLRの動作メカニズムについて議論する。私たちは異なる仮説を定式化し、Imagenet100データセットを使ってそれらをテストします。
論文参考訳（メタデータ） (2024-02-22T20:36:24Z)
Revealing Multimodal Contrastive Representation Learning through Latent Partial Causal Models [85.67870425656368]
マルチモーダルデータに特化して設計された統一因果モデルを提案する。マルチモーダル・コントラスト表現学習は潜在結合変数の同定に優れていることを示す。実験では、仮定が破られたとしても、我々の発見の堅牢性を示す。
論文参考訳（メタデータ） (2024-02-09T07:18:06Z)
Disentangling Multi-view Representations Beyond Inductive Bias [32.15900989696017]
本稿では,表現の解釈可能性と一般化性を両立させる新しい多視点表現分離手法を提案する。提案手法は,クラスタリングと分類性能において,12種類の比較手法より優れていることを示す。
論文参考訳（メタデータ） (2023-08-03T09:09:28Z)
Variational Distillation for Multi-View Learning [104.17551354374821]
我々は,多視点表現学習における2つの重要な特徴を利用するために,様々な情報ボトルネックを設計する。厳密な理論的保証の下で,本手法は,観察とセマンティックラベルの内在的相関の把握を可能にする。
論文参考訳（メタデータ） (2022-06-20T03:09:46Z)
Contrastive Instruction-Trajectory Learning for Vision-Language Navigation [66.16980504844233]
視覚言語ナビゲーション(VLN)タスクでは、エージェントが自然言語の指示でターゲットに到達する必要がある。先行研究は、命令-軌道対間の類似点と相違点を識別できず、サブ命令の時間的連続性を無視する。本稿では、類似したデータサンプル間の分散と、異なるデータサンプル間の分散を探索し、ロバストなナビゲーションのための独特な表現を学習するContrastive Instruction-Trajectory Learningフレームワークを提案する。
論文参考訳（メタデータ） (2021-12-08T06:32:52Z)
Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective [72.55093886515824]
本稿では,3つの潜伏変数群からなる動的過程として,運動予測の因果的形式化を導入する。我々は、因果グラフを近似するために、不変なメカニズムやスタイルの共創者の表現を分解するモジュラーアーキテクチャを考案する。合成および実データを用いた実験結果から,提案した3つの成分は,学習した動き表現の頑健性と再利用性を大幅に向上することが示された。
論文参考訳（メタデータ） (2021-11-29T18:59:09Z)
Learning Disentangled Representations with Latent Variation Predictability [102.4163768995288]
本稿では,潜在不整合表現の変動予測可能性について述べる。逆生成プロセス内では、潜時変動と対応する画像対の相互情報を最大化することにより、変動予測可能性を高める。本研究では,潜在表現の絡み合いを測るために,基礎的構造的生成因子に依存しない評価指標を開発する。
論文参考訳（メタデータ） (2020-07-25T08:54:26Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。