Related papers: EmoFeedback$^2$: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

EmoFeedback$^2$: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

URL: http://arxiv.org/abs/2511.19982v2
Date: Wed, 26 Nov 2025 03:44:53 GMT
Title: EmoFeedback$^2$: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback
Authors: Jingyang Jia, Kai Shu, Gang Yang, Long Xing, Xun Chen, Aiping Liu,
Abstract summary: We propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback$2$) for Continuous Emotional Image Generation (C-EICG)<n>We introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions.<n>Our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset.
Score: 35.44748809967547
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing approaches lack emotional feedback from generated images, limiting the control of emotional continuity. Additionally, their simple alignment between emotions and naively generated texts fails to adaptively adjust emotional prompts according to image content, leading to insufficient emotional fidelity. To address these concerns, we propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback$^2$) for C-EICG, which exploits the reasoning capability of the fine-tuned large vision-language model (LVLM) to provide reward and textual feedback for generating high-quality images with continuous emotions. Specifically, we introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions, guiding the reinforcement fine-tuning of the generative model and enhancing the emotional continuity of images. Furthermore, we design a self-promotion textual feedback framework, in which the LVLM iteratively analyzes the emotional content of generated images and adaptively produces refinement suggestions for the next-round prompt, improving the emotional fidelity with fine-grained content. Extensive experimental results demonstrate that our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset. The code and dataset will be released soon.

Related papers

EmoCtrl: Controllable Emotional Image Content Generation [9.677863079897735]
We introduce Controllable Emotional Image Content Generation (C-EICG)<n>C-EICG aims to generate images that remain faithful to a given content description while expressing a target emotion.<n>EmoCtrl is supported by a dataset annotated with content, emotion, and affective prompts.
arXiv Detail & Related papers (2025-12-27T02:18:36Z)
EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z)
CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation [3.5418954219513625]
Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories.<n>We propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability.<n>To intuitively showcase scalability, we curate EmoArt, a large-scale dataset of emotionally evocative artistic images.
arXiv Detail & Related papers (2025-08-05T15:04:34Z)
UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries [61.5273479616832]
We propose a unified framework that seamlessly integrates emotional understanding and generation.<n>We show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks.
arXiv Detail & Related papers (2025-07-31T09:39:27Z)
Affective Image Editing: Shaping Emotional Factors via Text Descriptions [46.13506671212571]
We introduce AIEdiT for Affective Image Editing using Text descriptions.<n>We build the continuous emotional spectrum and extract nuanced emotional requests.<n>AIEdiT achieves superior performance, effectively reflecting users' emotional requests.
arXiv Detail & Related papers (2025-05-24T13:46:57Z)
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics.<n>We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention.<n>Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks.<n>Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z)
EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model [23.26111054485357]
We introduce the new task of continuous emotional image content generation (C-EICG)<n>We present EmotiCrafter, an emotional image generation model that generates images based on text prompts and Valence-Arousal values.
arXiv Detail & Related papers (2025-01-10T04:41:37Z)
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.<n>Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.<n>We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z)
EmoEdit: Evoking Emotions through Image Manipulation [62.416345095776656]
Affective Image Manipulation (AIM) seeks to modify user-provided images to evoke specific emotional responses.<n>We introduce EmoEdit, which extends AIM by incorporating content modifications to enhance emotional impact.<n>Our method is evaluated both qualitatively and quantitatively, demonstrating superior performance compared to existing state-of-the-art techniques.
arXiv Detail & Related papers (2024-05-21T10:18:45Z)
ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains [61.50113532215864]
Causal Emotion Entailment (CEE) aims to identify the causal utterances in a conversation that stimulate the emotions expressed in a target utterance. Current works in CEE mainly focus on modeling semantic and emotional interactions in conversations. We introduce a step-by-step reasoning method, Emotion-Cause Reasoning Chain (ECR-Chain), to infer the stimulus from the target emotional expressions in conversations.
arXiv Detail & Related papers (2024-05-17T15:45:08Z)
Enhancing Emotional Generation Capability of Large Language Models via Emotional Chain-of-Thought [50.13429055093534]
Large Language Models (LLMs) have shown remarkable performance in various emotion recognition tasks. We propose the Emotional Chain-of-Thought (ECoT) to enhance the performance of LLMs on various emotional generation tasks.
arXiv Detail & Related papers (2024-01-12T16:42:10Z)
EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models [11.901294654242376]
We introduce Emotional Image Content Generation (EICG), a new task to generate semantic-clear and emotion-faithful images given emotion categories. Specifically, we propose an emotion space and construct a mapping network to align it with the powerful Contrastive Language-Image Pre-training (CLIP) space. Our method outperforms the state-of-the-art text-to-image approaches both quantitatively and qualitatively.
arXiv Detail & Related papers (2024-01-09T15:23:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.