Related papers: Deep Learning Approaches for Multimodal Intent Recognition: A Survey

Deep Learning Approaches for Multimodal Intent Recognition: A Survey

URL: http://arxiv.org/abs/2507.22934v1
Date: Thu, 24 Jul 2025 17:12:01 GMT
Title: Deep Learning Approaches for Multimodal Intent Recognition: A Survey
Authors: Jingwei Zhao, Yuhua Wen, Qifei Li, Minchi Hu, Yingying Zhou, Jingyao Xue, Junyang Wu, Yingming Gao, Zhengqi Wen, Jianhua Tao, Ya Li,
Abstract summary: Intent recognition aims to identify users' underlying intentions, traditionally focusing on text in natural language processing.<n>With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals.<n>This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges.
Score: 37.39741906112862
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Intent recognition aims to identify users' underlying intentions, traditionally focusing on text in natural language processing. With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals. Recently, the introduction of Transformer-based models has led to notable breakthroughs in this domain. This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges. It provides researchers with insights into the latest developments in multimodal intent recognition (MIR) and directions for future research.

Related papers

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications. Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z)
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector. The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z)
Reinforcement Learning Based Multi-modal Feature Fusion Network for Novel Class Discovery [47.28191501836041]
In this paper, we employ a Reinforcement Learning framework to simulate the cognitive processes of humans. We also deploy a Member-to-Leader Multi-Agent framework to extract and fuse features from multi-modal information. We demonstrate the performance of our approach in both the 3D and 2D domains by employing the OS-MN40, OS-MN40-Miss, and Cifar10 datasets.
arXiv Detail & Related papers (2023-08-26T07:55:32Z)
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations. Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task. This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z)
Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area. We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z)
A Review on Methods and Applications in Multimodal Deep Learning [8.152125331009389]
Multimodal deep learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth.
arXiv Detail & Related papers (2022-02-18T13:50:44Z)
Recent Advances and Trends in Multimodal Deep Learning: A Review [9.11022096530605]
Multimodal deep learning aims to create models that can process and link information using various modalities. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. A fine-grained taxonomy of various multimodal deep learning applications is proposed, elaborating on different applications in more depth.
arXiv Detail & Related papers (2021-05-24T04:20:45Z)
A Review on Explainability in Multimodal Deep Neural Nets [2.3204178451683264]
multimodal AI techniques have achieved much success in several application domains. Despite their outstanding performance, the complex, opaque and black-box nature of the deep neural nets limits their social acceptance and usability. This paper extensively reviews the present literature to present a comprehensive survey and commentary on the explainability in multimodal deep neural nets.
arXiv Detail & Related papers (2021-05-17T14:17:49Z)
Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities [52.59080024266596]
We present a survey of the state-of-the-art deep learning methods for sensor-based human activity recognition. We first introduce the multi-modality of the sensory data and provide information for public datasets. We then propose a new taxonomy to structure the deep methods by challenges.
arXiv Detail & Related papers (2020-01-21T09:55:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.