Related papers: A Review on Methods and Applications in Multimodal Deep Learning

A Review on Methods and Applications in Multimodal Deep Learning

URL: http://arxiv.org/abs/2202.09195v1
Date: Fri, 18 Feb 2022 13:50:44 GMT
Title: A Review on Methods and Applications in Multimodal Deep Learning
Authors: Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Jabbar Abdul
Abstract summary: Multimodal deep learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth.
Score: 8.152125331009389
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the last five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Lastly, main issues are highlighted separately for each domain, along with their possible future research directions.

Related papers

Deep Learning Approaches for Multimodal Intent Recognition: A Survey [37.39741906112862]
Intent recognition aims to identify users' underlying intentions, traditionally focusing on text in natural language processing.<n>With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals.<n>This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges.
arXiv Detail & Related papers (2025-07-24T17:12:01Z)
Gradient-Based Multi-Objective Deep Learning: Algorithms, Theories, Applications, and Beyond [35.78910104369677]
Multi-objective optimization (MOO) in deep learning aims to simultaneously optimize multiple conflicting objectives. Advancements in gradient-based MOO methods have enabled the discovery of diverse types of solutions. These developments have broad applications across domains such as reinforcement learning, computer vision, recommendation systems, and large language models.
arXiv Detail & Related papers (2025-01-19T04:56:55Z)
Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review [3.0712840129998513]
This literature review proposes a taxonomy and framework that encapsulates recent methodological advances in this field. We introduce a novel data fusion category -- mid fusion -- and a graph-based technique for refining literature reviews, termed citation graph pruning. There remains a need for further research to bridge the divide between multimodal learning and training studies and foundational AI research.
arXiv Detail & Related papers (2024-08-22T22:42:23Z)
LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z)
Learning on Multimodal Graphs: A Survey [6.362513821299131]
Multimodal data pervades various domains, including healthcare, social media, and transportation. multimodal graph learning (MGL) is essential for successful artificial intelligence (AI) applications.
arXiv Detail & Related papers (2024-02-07T23:50:00Z)
Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval. Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases. We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z)
Domain Generalization for Mammographic Image Analysis with Contrastive Learning [62.25104935889111]
The training of an efficacious deep learning model requires large data with diverse styles and qualities. A novel contrastive learning is developed to equip the deep learning models with better style generalization capability. The proposed method has been evaluated extensively and rigorously with mammograms from various vendor style domains and several public datasets.
arXiv Detail & Related papers (2023-04-20T11:40:21Z)
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations. Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task. This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z)
Multi-Task Learning for Visual Scene Understanding [7.191593674138455]
This thesis is concerned with multi-task learning in the context of computer vision. We propose several methods that tackle important aspects of multi-task learning. The results show several advances in the state-of-the-art of multi-task learning.
arXiv Detail & Related papers (2022-03-28T16:57:58Z)
Channel Exchanging Networks for Multimodal and Multitask Dense Image Prediction [125.18248926508045]
We propose Channel-Exchanging-Network (CEN) which is self-adaptive, parameter-free, and more importantly, applicable for both multimodal fusion and multitask learning. CEN dynamically exchanges channels betweenworks of different modalities. For the application of dense image prediction, the validity of CEN is tested by four different scenarios.
arXiv Detail & Related papers (2021-12-04T05:47:54Z)
Recent Advances and Trends in Multimodal Deep Learning: A Review [9.11022096530605]
Multimodal deep learning aims to create models that can process and link information using various modalities. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. A fine-grained taxonomy of various multimodal deep learning applications is proposed, elaborating on different applications in more depth.
arXiv Detail & Related papers (2021-05-24T04:20:45Z)
Deep Multimodal Neural Architecture Search [178.35131768344246]
We devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone. On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks.
arXiv Detail & Related papers (2020-04-25T07:00:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.