A Review on Methods and Applications in Multimodal Deep Learning
- URL: http://arxiv.org/abs/2202.09195v1
- Date: Fri, 18 Feb 2022 13:50:44 GMT
- Title: A Review on Methods and Applications in Multimodal Deep Learning
- Authors: Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Jabbar Abdul
- Abstract summary: Multimodal deep learning helps to understand and analyze better when various senses are engaged in the processing of information.
This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals.
A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth.
- Score: 8.152125331009389
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Learning has implemented a wide range of applications and has become
increasingly popular in recent years. The goal of multimodal deep learning
(MMDL) is to create models that can process and link information using various
modalities. Despite the extensive development made for unimodal learning, it
still cannot cover all the aspects of human learning. Multimodal learning helps
to understand and analyze better when various senses are engaged in the
processing of information. This paper focuses on multiple types of modalities,
i.e., image, video, text, audio, body gestures, facial expressions, and
physiological signals. Detailed analysis of the baseline approaches and an
in-depth study of recent advancements during the last five years (2017 to 2021)
in multimodal deep learning applications has been provided. A fine-grained
taxonomy of various multimodal deep learning methods is proposed, elaborating
on different applications in more depth. Lastly, main issues are highlighted
separately for each domain, along with their possible future research
directions.
Related papers
- Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review [3.0712840129998513]
This literature review proposes a taxonomy and framework that encapsulates recent methodological advances in this field.
We introduce a novel data fusion category -- mid fusion -- and a graph-based technique for refining literature reviews, termed citation graph pruning.
There remains a need for further research to bridge the divide between multimodal learning and training studies and foundational AI research.
arXiv Detail & Related papers (2024-08-22T22:42:23Z) - LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio.
We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods.
We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z) - Learning on Multimodal Graphs: A Survey [6.362513821299131]
Multimodal data pervades various domains, including healthcare, social media, and transportation.
multimodal graph learning (MGL) is essential for successful artificial intelligence (AI) applications.
arXiv Detail & Related papers (2024-02-07T23:50:00Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - Domain Generalization for Mammographic Image Analysis with Contrastive
Learning [62.25104935889111]
The training of an efficacious deep learning model requires large data with diverse styles and qualities.
A novel contrastive learning is developed to equip the deep learning models with better style generalization capability.
The proposed method has been evaluated extensively and rigorously with mammograms from various vendor style domains and several public datasets.
arXiv Detail & Related papers (2023-04-20T11:40:21Z) - Multimodality Representation Learning: A Survey on Evolution,
Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations.
Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task.
This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z) - Multi-Task Learning for Visual Scene Understanding [7.191593674138455]
This thesis is concerned with multi-task learning in the context of computer vision.
We propose several methods that tackle important aspects of multi-task learning.
The results show several advances in the state-of-the-art of multi-task learning.
arXiv Detail & Related papers (2022-03-28T16:57:58Z) - Channel Exchanging Networks for Multimodal and Multitask Dense Image
Prediction [125.18248926508045]
We propose Channel-Exchanging-Network (CEN) which is self-adaptive, parameter-free, and more importantly, applicable for both multimodal fusion and multitask learning.
CEN dynamically exchanges channels betweenworks of different modalities.
For the application of dense image prediction, the validity of CEN is tested by four different scenarios.
arXiv Detail & Related papers (2021-12-04T05:47:54Z) - Recent Advances and Trends in Multimodal Deep Learning: A Review [9.11022096530605]
Multimodal deep learning aims to create models that can process and link information using various modalities.
This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals.
A fine-grained taxonomy of various multimodal deep learning applications is proposed, elaborating on different applications in more depth.
arXiv Detail & Related papers (2021-05-24T04:20:45Z) - Deep Multimodal Neural Architecture Search [178.35131768344246]
We devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks.
Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone.
On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks.
arXiv Detail & Related papers (2020-04-25T07:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.