Fugu-MT 論文翻訳(概要): The Indra Representation Hypothesis for Multimodal Alignment

論文の概要: The Indra Representation Hypothesis for Multimodal Alignment

arxiv url: http://arxiv.org/abs/2604.04496v1
Date: Mon, 06 Apr 2026 07:46:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.136058
Title: The Indra Representation Hypothesis for Multimodal Alignment
Title（参考訳）: マルチモーダルアライメントのためのインドラ表現仮説
Authors: Jianglin Lu, Hailing Wang, Kuo Yang, Yitian Zhang, Simon Jenni, Yun Fu,
Abstract要約: Indraのネットの哲学的比喩に触発された『Indra Representation hypothesis』を提案する。我々は、一助基盤モデルからの表現が収束し、現実の下の共有関係構造を暗黙的に反映していると論じる。
参考スコア（独自算出の注目度）: 46.60107187498204
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra's Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra's Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at https://github.com/Jianglin954/Indra.
Abstract（参考訳）: 単調な基礎モデルは、アーキテクチャ、訓練目的、データモダリティの違いにかかわらず、収束表現を学習する傾向がある。しかし、これらの表現は基本的に標本を独立に特徴づけるサンプルの内部抽象であり、限定的な表現性をもたらす。本稿では,インドラネットの哲学的比喩に触発された「インドラ表現仮説」を提案する。 Indra's Netのリレーショナルオントロジーに類似した、現実を基盤とする共有リレーショナル構造を暗黙的に反映するために、ユニモーダル基礎モデルからの表現が収束していると論じる。我々は、この仮説を、圏論からの V-enriched Yoneda 埋め込みを用いて定式化し、Indra 表現を、他者に対する各サンプルの相関プロファイルとして定義する。この定式化は、与えられたコスト関数の下で一意、完全、構造保存であることが示される。角距離を用いてIndra表現をインスタンス化し、視覚、言語、音声を含むクロスモデルおよびクロスモーダルシナリオで評価する。広範囲な実験により、Indra表現はアーキテクチャやモダリティをまたいだ堅牢性と整合性を一貫して強化し、理論上は根拠のない一助基盤モデルのトレーニングなしアライメントのための実践的な枠組みを提供する。私たちのコードはhttps://github.com/Jianglin954/Indra.comから入手可能です。

論文の概要: The Indra Representation Hypothesis for Multimodal Alignment

関連論文リスト