Fugu-MT 論文翻訳(概要): Fusion or Confusion? Multimodal Complexity Is Not All You Need

論文の概要: Fusion or Confusion? Multimodal Complexity Is Not All You Need

arxiv url: http://arxiv.org/abs/2512.22991v1
Date: Sun, 28 Dec 2025 16:20:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-30 22:37:30.31441
Title: Fusion or Confusion? Multimodal Complexity Is Not All You Need
Title（参考訳）: 核融合か核融合か? マルチモーダル複雑度だけでは十分ではない
Authors: Tillmann Rheude, Roland Eils, Benjamin Wild,
Abstract要約: 標準化された条件下で19のハイインパクト手法を再実装し、最大23のモダリティを持つ9つの多様なデータセットで評価する。本稿では,マルチモーダル学習のための簡易ベースライン(SimBaMM)を提案する。私たちは、アーキテクチャのノベルティの追求から離れて、方法論的な厳格さへと焦点を移すことを主張します。
参考スコア（独自算出の注目度）: 1.2472265402088736
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning architectures for multimodal learning have increased in complexity, driven by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study reimplementing 19 high-impact methods under standardized conditions, evaluating them across nine diverse datasets with up to 23 modalities, and testing their generalizability to new tasks beyond their original scope, including settings with missing modalities. We propose a Simple Baseline for Multimodal Learning (SimBaMM), a straightforward late-fusion Transformer architecture, and demonstrate that under standardized experimental conditions with rigorous hyperparameter tuning of all methods, more complex architectures do not reliably outperform SimBaMM. Statistical analysis indicates that more complex methods perform comparably to SimBaMM and frequently do not reliably outperform well-tuned unimodal baselines, especially in the small-data regime considered in many original studies. To support our findings, we include a case study of a recent multimodal learning method highlighting the methodological shortcomings in the literature. In addition, we provide a pragmatic reliability checklist to promote comparable, robust, and trustworthy future evaluations. In summary, we argue for a shift in focus: away from the pursuit of architectural novelty and toward methodological rigor.
Abstract（参考訳）: マルチモーダル学習のためのディープラーニングアーキテクチャは、マルチモーダル固有の手法がパフォーマンスを向上させるという仮定によって、複雑さが増している。我々は、この仮定を、標準化された条件下で19のハイインパクトな手法を再実装し、最大23のモダリティを持つ9つの多様なデータセットにまたがって評価し、欠落したモダリティを持つ設定を含む、元のスコープを超えた新しいタスクへの一般化性をテストすることで、大規模な実証的研究を通じて挑戦する。 SimBaMM (Simple Baseline for Multimodal Learning) は、単純なレイトフュージョントランスフォーマーアーキテクチャであり、全ての手法の厳密なハイパーパラメータチューニングによる標準的な実験条件下では、より複雑なアーキテクチャがSimBaMMを確実に上回らないことを示す。統計的分析によると、より複雑な手法はSimBaMMと互換性があり、特に多くの原研究で考慮された小型データ体制において、よく調整された単調なベースラインを確実に上回らないことが多い。文献の方法論的欠点を浮き彫りにした,近年のマルチモーダル学習手法の事例研究を含む。さらに、我々は、同等で堅牢で信頼性の高い将来の評価を促進するための実用的信頼性チェックリストを提供する。まとめると、我々は、アーキテクチャのノベルティの追求から離れて、方法論の厳格化へと焦点を移すことを主張している。

論文の概要: Fusion or Confusion? Multimodal Complexity Is Not All You Need

関連論文リスト