Recent Advances and Trends in Multimodal Deep Learning: A Review
- URL: http://arxiv.org/abs/2105.11087v1
- Date: Mon, 24 May 2021 04:20:45 GMT
- Title: Recent Advances and Trends in Multimodal Deep Learning: A Review
- Authors: Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Songyuan Li and Jabbar
Abdul
- Abstract summary: Multimodal deep learning aims to create models that can process and link information using various modalities.
This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals.
A fine-grained taxonomy of various multimodal deep learning applications is proposed, elaborating on different applications in more depth.
- Score: 9.11022096530605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Learning has implemented a wide range of applications and has become
increasingly popular in recent years. The goal of multimodal deep learning is
to create models that can process and link information using various
modalities. Despite the extensive development made for unimodal learning, it
still cannot cover all the aspects of human learning. Multimodal learning helps
to understand and analyze better when various senses are engaged in the
processing of information. This paper focuses on multiple types of modalities,
i.e., image, video, text, audio, body gestures, facial expressions, and
physiological signals. Detailed analysis of past and current baseline
approaches and an in-depth study of recent advancements in multimodal deep
learning applications has been provided. A fine-grained taxonomy of various
multimodal deep learning applications is proposed, elaborating on different
applications in more depth. Architectures and datasets used in these
applications are also discussed, along with their evaluation metrics. Last,
main issues are highlighted separately for each domain along with their
possible future research directions.
Related papers
- Deep Learning-based Depth Estimation Methods from Monocular Image and Videos: A Comprehensive Survey [31.414360704020254]
Estimating depth from single RGB images and videos is of widespread interest due to its applications in many areas.
More than 500 deep learning-based papers have been published in the past 10 years.
It provides a taxonomy for classifying the current work based on their input and output modalities, network architectures, and learning methods.
arXiv Detail & Related papers (2024-06-28T06:25:21Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - Multimodality Representation Learning: A Survey on Evolution,
Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations.
Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task.
This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z) - Knowledge-augmented Deep Learning and Its Applications: A Survey [60.221292040710885]
knowledge-augmented deep learning (KADL) aims to identify domain knowledge and integrate it into deep models for data-efficient, generalizable, and interpretable deep learning.
This survey subsumes existing works and offers a bird's-eye view of research in the general area of knowledge-augmented deep learning.
arXiv Detail & Related papers (2022-11-30T03:44:15Z) - Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area.
We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions.
We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z) - A Review on Methods and Applications in Multimodal Deep Learning [8.152125331009389]
Multimodal deep learning helps to understand and analyze better when various senses are engaged in the processing of information.
This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals.
A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth.
arXiv Detail & Related papers (2022-02-18T13:50:44Z) - Deep Long-Tailed Learning: A Survey [163.16874896812885]
Deep long-tailed learning aims to train well-performing deep models from a large number of images that follow a long-tailed class distribution.
Long-tailed class imbalance is a common problem in practical visual recognition tasks.
This paper provides a comprehensive survey on recent advances in deep long-tailed learning.
arXiv Detail & Related papers (2021-10-09T15:25:22Z) - A Comprehensive Survey on Community Detection with Deep Learning [93.40332347374712]
A community reveals the features and connections of its members that are different from those in other communities in a network.
This survey devises and proposes a new taxonomy covering different categories of the state-of-the-art methods.
The main category, i.e., deep neural networks, is further divided into convolutional networks, graph attention networks, generative adversarial networks and autoencoders.
arXiv Detail & Related papers (2021-05-26T14:37:07Z) - A Review on Explainability in Multimodal Deep Neural Nets [2.3204178451683264]
multimodal AI techniques have achieved much success in several application domains.
Despite their outstanding performance, the complex, opaque and black-box nature of the deep neural nets limits their social acceptance and usability.
This paper extensively reviews the present literature to present a comprehensive survey and commentary on the explainability in multimodal deep neural nets.
arXiv Detail & Related papers (2021-05-17T14:17:49Z) - Discussion of Ensemble Learning under the Era of Deep Learning [4.061135251278187]
Ensemble deep learning has shown significant performances in improving the generalization of learning system.
Time and space overheads for training multiple base deep learners and testing with the ensemble deep learner are far greater than that of traditional ensemble learning.
An urgent problem needs to be solved is how to take the significant advantages of ensemble deep learning while reduce the required time and space overheads.
arXiv Detail & Related papers (2021-01-21T01:33:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.