Emotion-Driven Personalized Recommendation for AI-Generated Content Using Multi-Modal Sentiment and Intent Analysis
- URL: http://arxiv.org/abs/2512.10963v1
- Date: Tue, 25 Nov 2025 17:52:22 GMT
- Title: Emotion-Driven Personalized Recommendation for AI-Generated Content Using Multi-Modal Sentiment and Intent Analysis
- Authors: Zheqi Hu, Xuanjing Chen, Jinlin Hu,
- Abstract summary: This study proposes a Multi-Modal Emotion and Recognition Model (MMEI) based on a BERT-based Cross-Modal Transformer with Attention-Based Fusion.<n>MMEI processes visual (facial expression), auditory (speech tone), and textual (comments or utterances) modalities through pretrained encoders ViT, Wav2Vec2, and BERT.<n>MMEI model achieves a 4.3% improvement in F1-score and a 12.3% reduction in cross-entropy loss compared to the best fusion-based transformer baseline.
- Score: 1.111322427650556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid growth of AI-generated content (AIGC) across domains such as music, video, and literature, the demand for emotionally aware recommendation systems has become increasingly important. Traditional recommender systems primarily rely on user behavioral data such as clicks, views, or ratings, while neglecting users' real-time emotional and intentional states during content interaction. To address this limitation, this study proposes a Multi-Modal Emotion and Intent Recognition Model (MMEI) based on a BERT-based Cross-Modal Transformer with Attention-Based Fusion, integrated into a cloud-native personalized AIGC recommendation framework. The proposed system jointly processes visual (facial expression), auditory (speech tone), and textual (comments or utterances) modalities through pretrained encoders ViT, Wav2Vec2, and BERT, followed by an attention-based fusion module to learn emotion-intent representations. These embeddings are then used to drive personalized content recommendations through a contextual matching layer. Experiments conducted on benchmark emotion datasets (AIGC-INT, MELD, and CMU-MOSEI) and an AIGC interaction dataset demonstrate that the proposed MMEI model achieves a 4.3% improvement in F1-score and a 12.3% reduction in cross-entropy loss compared to the best fusion-based transformer baseline. Furthermore, user-level online evaluations reveal that emotion-driven recommendations increase engagement time by 15.2% and enhance satisfaction scores by 11.8%, confirming the model's effectiveness in aligning AI-generated content with users' affective and intentional states. This work highlights the potential of cross-modal emotional intelligence for next-generation AIGC ecosystems, enabling adaptive, empathetic, and context-aware recommendation experiences.
Related papers
- A Cloud-Based Cross-Modal Transformer for Emotion Recognition and Adaptive Human-Computer Interaction [4.6927139685668315]
Cloud-Based Cross-Modal Transformer (CMT) framework for multimodal emotion recognition and adaptive human-computer interaction.<n>Model integrates visual, auditory, and textual signals using pretrained encoders.<n>System enables scalable, low-latency emotion recognition for large-scale user interactions.
arXiv Detail & Related papers (2025-11-21T17:29:16Z) - Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition [54.44798086835314]
Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions.<n>We propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations.<n> Experiments on the MSP-Podcast corpus show that Sparse MERIT consistently outperforms baseline models on both SER and SE tasks.
arXiv Detail & Related papers (2025-09-10T10:18:56Z) - Leveraging Label Potential for Enhanced Multimodal Emotion Recognition [6.725011823614421]
Multimodal emotion recognition (MER) seeks to integrate various modalities to predict emotional states accurately.<n>We introduce a novel model called Label Signal-Guided Multimodal Emotion Recognition (LSGMER) to overcome this limitation.
arXiv Detail & Related papers (2025-04-07T15:00:34Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations [1.0690007351232649]
Multimodal approaches benefit from the fusion of diverse modalities, thereby improving the recognition accuracy.<n>The proposed Qieemo framework effectively utilizes the pretrained automatic speech recognition (ASR) model which contains naturally frame aligned textual and emotional features.<n>The experimental results on the IEMOCAP dataset demonstrate that Qieemo outperforms the benchmark unimodal, multimodal, and self-supervised models with absolute improvements of 3.0%, 1.2%, and 1.9% respectively.
arXiv Detail & Related papers (2025-03-05T07:02:30Z) - Towards Empathetic Conversational Recommender Systems [77.53167131692]
We propose an empathetic conversational recommender (ECR) framework.
ECR contains two main modules: emotion-aware item recommendation and emotion-aligned response generation.
Our experiments on the ReDial dataset validate the efficacy of our framework in enhancing recommendation accuracy and improving user satisfaction.
arXiv Detail & Related papers (2024-08-30T15:43:07Z) - Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop [65.23044868332693]
We explore how AI-generated content (AIGC) affects the performance and dynamics of recommender systems.<n>In the short term, bias toward AIGC encourages LLM-based content creation, increasing AIGC content, and causing unfair traffic distribution.<n>We propose a debiasing method based on L1-loss optimization to maintain long-term content ecosystem balance.
arXiv Detail & Related papers (2024-05-28T09:34:50Z) - Deep Imbalanced Learning for Multimodal Emotion Recognition in
Conversations [15.705757672984662]
Multimodal Emotion Recognition in Conversations (MERC) is a significant development direction for machine intelligence.
Many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition.
We propose the Class Boundary Enhanced Representation Learning (CBERL) model to address the imbalanced distribution of emotion categories in raw data.
We have conducted extensive experiments on the IEMOCAP and MELD benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition.
arXiv Detail & Related papers (2023-12-11T12:35:17Z) - Guiding AI-Generated Digital Content with Wireless Perception [69.51950037942518]
We introduce an integration of wireless perception with AI-generated content (AIGC) to improve the quality of digital content production.
The framework employs a novel multi-scale perception technology to read user's posture, which is difficult to describe accurately in words, and transmits it to the AIGC model as skeleton images.
Since the production process imposes the user's posture as a constraint on the AIGC model, it makes the generated content more aligned with the user's requirements.
arXiv Detail & Related papers (2023-03-26T04:39:03Z) - EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation [34.24557248359872]
We propose an emotional inertia and contagion-driven dependency modeling approach (EmotionIC) for ERC task.
Our EmotionIC consists of three main components, i.e., Identity Masked Multi-Head Attention (IMMHA), Dialogue-based Gated Recurrent Unit (DiaGRU) and Skip-chain Conditional Random Field (SkipCRF)
Experimental results show that our method can significantly outperform the state-of-the-art models on four benchmark datasets.
arXiv Detail & Related papers (2023-03-20T13:58:35Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.