Related papers: PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection

PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection

URL: http://arxiv.org/abs/2511.12130v1
Date: Sat, 15 Nov 2025 09:35:58 GMT
Title: PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection
Authors: Bingbing Wang, Zhixin Bai, Zhengda Jin, Zihan Wang, Xintong Song, Jingjie Lin, Sixuan Li, Jing Li, Ruifeng Xu,
Abstract summary: Multimodal Conversational Stance Detection (MCSD) aims to interpret users' attitudes toward specific targets within complex discussions.<n>We introduce **U-MStance**, the first user-centric MCSD dataset, containing over 40k annotated comments across six real-world targets.<n>We propose PRISM, a **P**ersona-**R**easoned mult**I**modal **S**tance **M**odel for MCSD.
Score: 27.63546120178429
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The rapid proliferation of multimodal social media content has driven research in Multimodal Conversational Stance Detection (MCSD), which aims to interpret users' attitudes toward specific targets within complex discussions. However, existing studies remain limited by: **1) pseudo-multimodality**, where visual cues appear only in source posts while comments are treated as text-only, misaligning with real-world multimodal interactions; and **2) user homogeneity**, where diverse users are treated uniformly, neglecting personal traits that shape stance expression. To address these issues, we introduce **U-MStance**, the first user-centric MCSD dataset, containing over 40k annotated comments across six real-world targets. We further propose **PRISM**, a **P**ersona-**R**easoned mult**I**modal **S**tance **M**odel for MCSD. PRISM first derives longitudinal user personas from historical posts and comments to capture individual traits, then aligns textual and visual cues within conversational context via Chain-of-Thought to bridge semantic and pragmatic gaps across modalities. Finally, a mutual task reinforcement mechanism is employed to jointly optimize stance detection and stance-aware response generation for bidirectional knowledge transfer. Experiments on U-MStance demonstrate that PRISM yields significant gains over strong baselines, underscoring the effectiveness of user-centric and context-grounded multimodal reasoning for realistic stance understanding.

Related papers

MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents [1.919885803437747]
We introduce MPCI-Bench, the first Multimodal Pairwise Contextual Integrity benchmark for evaluating privacy behavior in agentic settings.<n> MPCI-Bench consists of paired positive and negative instances derived from the same visual source.<n>We will open-source MPCI-Bench to facilitate future research on agentic CI.
arXiv Detail & Related papers (2026-01-13T05:39:43Z)
Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation [60.63465682731118]
The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity.<n>We introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks.<n>Our framework improves the intent clarification performance of small language models by approximately 30%, making them competitive with significantly larger counterparts.
arXiv Detail & Related papers (2025-11-12T04:28:14Z)
ReMoD: Rethinking Modality Contribution in Multimodal Stance Detection via Dual Reasoning [29.400983680521733]
Multimodal Stance Detection (MSD) is a crucial task for understanding public opinion on social media.<n>Existing work simply fuses information from various modalities to learn stance representations, overlooking the varying contributions of stance expression from different modalities.<n>We propose **ReMoD**, a framework that **Re**thinks **Mo**dality contribution of stance expression through a **D**ual-reasoning paradigm.
arXiv Detail & Related papers (2025-11-08T15:56:24Z)
VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion [18.017186369021154]
VOGUE is a novel dataset of 60 humanhuman dialogues in realistic fashion shopping scenarios.<n>Each dialogue is paired with a shared visual catalogue, item metadata, user fashion profiles and histories, and post-conversation ratings from both Seekers and Assistants.<n>Our initial analyses of VOGUE reveal distinctive dynamics of visually grounded dialogue.
arXiv Detail & Related papers (2025-10-24T04:45:29Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing [133.0641538589466]
RMTBench is a comprehensive textbfuser-centric bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds.<n>Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications.<n>By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements.
arXiv Detail & Related papers (2025-07-27T16:49:47Z)
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z)
Grounding Task Assistance with Multimodal Cues from a Single Demonstration [17.975173937253494]
We introduce MICA (Multimodal Interactive Contextualized Assistance), a framework that improves conversational agents for task assistance by integrating eye gaze and speech cues.<n> Evaluations on questions derived from real-time chat-assisted task replication show that multimodal cues significantly improve response quality over frame-based retrieval.
arXiv Detail & Related papers (2025-05-02T20:43:11Z)
Training-Free Personalization via Retrieval and Reasoning on Fingerprints [37.54948724318688]
Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts.<n>We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs.<n>R2P consistently outperforms state-of-the-art approaches on various downstream tasks.
arXiv Detail & Related papers (2025-03-24T12:36:24Z)
PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis [74.41260927676747]
This paper bridges the gaps by introducing a multimodal conversational Sentiment Analysis (ABSA) To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements. To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism.
arXiv Detail & Related papers (2024-08-18T13:51:01Z)
MPCHAT: Towards Multimodal Persona-Grounded Conversation [54.800425322314105]
We extend persona-based dialogue to the multimodal domain and make two main contributions. First, we present the first multimodal persona-based dialogue dataset named MPCHAT. Second, we empirically show that incorporating multimodal persona, as measured by three proposed multimodal persona-grounded dialogue tasks, leads to statistically significant performance improvements.
arXiv Detail & Related papers (2023-05-27T06:46:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.