Fugu-MT 論文翻訳(概要): Vision+X: A Survey on Multimodal Learning in the Light of Data

論文の概要: Vision+X: A Survey on Multimodal Learning in the Light of Data

arxiv url: http://arxiv.org/abs/2210.02884v1
Date: Wed, 5 Oct 2022 13:14:57 GMT
ステータス: 翻訳完了
システム内更新日: 2022-10-07 16:53:36.109210
Title: Vision+X: A Survey on Multimodal Learning in the Light of Data
Title（参考訳）: Vision+X: データの光におけるマルチモーダル学習に関する調査
Authors: Ye Zhu, Yu Wu, Nicu Sebe, Yan Yan
Abstract要約: 様々なモダリティのデータを組み込んだマルチモーダル機械学習は、ますます人気のある研究分野になりつつある。我々は、視覚、音声、テキストなど、各データフォーマットの共通点と特異点を分析し、Vision+Xの組み合わせによって分類された技術開発を提示する。
参考スコア（独自算出の注目度）: 71.07658443380264
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system. To endow the machines with true intelligence, the multimodal machine learning that incorporates data from various modalities has become an increasingly popular research area with emerging technical advances in recent years. In this paper, we present a survey on multimodal machine learning from a novel perspective considering not only the purely technical aspects but also the nature of different data modalities. We analyze the commonness and uniqueness of each data format ranging from vision, audio, text and others, and then present the technical development categorized by the combination of Vision+X, where the vision data play a fundamental role in most multimodal learning works. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels, and provide an additional comparison in the light of their technical connections with the data nature, e.g., the semantic consistency between image objects and textual descriptions, or the rhythm correspondence between video dance moves and musical beats. The exploitation of the alignment, as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address and solve a specific challenge related to the concrete multimodal task, and to prompt a unified multimodal machine learning framework closer to a real human intelligence system.
Abstract（参考訳）: 我々は、異なる情報ソースが高度に処理され、人間の脳の別々の部分によって解釈され、複雑で調和し、統一された知覚システムを構成する多感覚的な方法で世界と認識し、コミュニケーションしている。機械に真の知性を授けるために、様々なモダリティからデータを取り入れたマルチモーダル機械学習は、近年、技術進歩とともに、ますます人気が高まっている。本稿では、純粋に技術的な側面だけでなく、異なるデータモダリティの性質も考慮した新しい視点から、マルチモーダル機械学習に関する調査を行う。視覚,音声,テキストなど,各データフォーマットの共通点と特異点を分析し,視覚データが多くのマルチモーダル学習作業において基本的な役割を果たすビジョン+Xの組み合わせによって分類された技術開発を提示する。本研究では,表現学習レベルと下流アプリケーションレベルの両方から既存のマルチモーダル学習に関する文献を調査し,画像オブジェクトとテキスト記述間の意味的一貫性や,ビデオダンスの動きと音楽のビートとのリズム対応など,データの性質との技術的関係の観点から,さらなる比較を行う。データモダリティの本質的性質と技術的な設計との間の既存のギャップに加えて、アライメントの活用は、具体的なマルチモーダルタスクに関連する特定の課題に対処し解決し、真の人間の知能システムに近い統一されたマルチモーダル機械学習フレームワークを促進するために、将来の研究研究に役立つだろう。

関連論文リスト

Quantifying Cross-Modality Memorization in Vision-Language Models [86.82366725590508]
モーダリティ記憶のユニークな特徴について検討し,視覚言語モデルを中心とした体系的な研究を行う。以上の結果から,一方のモダリティが他方のモダリティに伝達されることが判明したが,情報源の情報と対象のモダリティの間には大きなギャップがあることがわかった。
論文参考訳（メタデータ） (2025-06-05T16:10:47Z)
Multimodal Alignment and Fusion: A Survey [7.250878248686215]
マルチモーダル統合により、モデルの精度と適用性が改善される。我々は既存のアライメントと融合の手法を体系的に分類し分析する。この調査は、ソーシャルメディア分析、医療画像、感情認識といった分野の応用に焦点を当てている。
論文参考訳（メタデータ） (2024-11-26T02:10:27Z)
Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review [3.0712840129998513]
本稿では,近年の方法論的進歩を包括する分類学と枠組みを提案する。我々は,新たなデータ融合カテゴリであるMid fusionを導入し,文献レビューを精査するグラフベースの手法を引用グラフプルーニングと呼ぶ。マルチモーダル学習とトレーニング研究と基礎的AI研究のギャップを埋めるために、さらなる研究が必要である。
論文参考訳（メタデータ） (2024-08-22T22:42:23Z)
ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers [1.6541870997607049]
変換器の高度な特徴抽出機能を備えた大規模言語モデルの非並列的文脈理解を融合したアーキテクチャであるARPAを提案する。 ARPAの導入は、視覚的単語の曖昧さにおいて重要なマイルストーンであり、魅力的なソリューションを提供する。我々は研究者や実践者たちに、このようなハイブリッドモデルが人工知能の先例のない進歩を後押しする未来を想像して、我々のモデルの能力を探求するよう依頼する。
論文参考訳（メタデータ） (2024-08-12T10:15:13Z)
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions [68.6358773622615]
本稿では,マルチモーダル機械学習の計算的基礎と理論的基礎について概説する。本稿では,表現,アライメント,推論,生成,伝達,定量化という,6つの技術課題の分類法を提案する。最近の技術的成果は、この分類のレンズを通して示され、研究者は新しいアプローチの類似点と相違点を理解することができる。
論文参考訳（メタデータ） (2022-09-07T19:21:19Z)
Causal Reasoning Meets Visual Representation Learning: A Prospective Study [117.08431221482638]
解釈可能性の欠如、堅牢性、分布外一般化が、既存の視覚モデルの課題となっている。人間レベルのエージェントの強い推論能力にインスパイアされた近年では、因果推論パラダイムの開発に多大な努力が注がれている。本稿では,この新興分野を包括的に概観し,注目し,議論を奨励し,新たな因果推論手法の開発の急激さを先導することを目的とする。
論文参考訳（メタデータ） (2022-04-26T02:22:28Z)
Multimodal Image Synthesis and Editing: The Generative AI Era [131.9569600472503]
マルチモーダル画像合成と編集は近年ホットな研究テーマになっている。近年のマルチモーダル画像合成・編集の進歩を包括的に理解している。ベンチマークデータセットと評価指標と,それに対応する実験結果について述べる。
論文参考訳（メタデータ） (2021-12-27T10:00:16Z)
WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
我々は,膨大なマルチモーダル(視覚的・テキスト的)データを事前学習した新しい基礎モデルを開発する。そこで本研究では,様々な下流タスクにおいて,最先端の成果が得られることを示す。
論文参考訳（メタデータ） (2021-10-27T12:25:21Z)
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation [64.43440450794495]
ロボット操作のための6つのオフライン学習アルゴリズムについて広範な研究を行う。我々の研究は、オフラインの人間のデータから学習する際の最も重要な課題を分析します。人間のデータセットから学ぶ機会を強調します。
論文参考訳（メタデータ） (2021-08-06T20:48:30Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。