Multi-speaker Attention Alignment for Multimodal Social Interaction
- URL: http://arxiv.org/abs/2511.17952v1
- Date: Sat, 22 Nov 2025 07:35:47 GMT
- Title: Multi-speaker Attention Alignment for Multimodal Social Interaction
- Authors: Liangyang Ouyang, Yifei Huang, Mingfang Zhang, Caixin Kang, Ryosuke Furuta, Yoichi Sato,
- Abstract summary: Social interaction in video requires reasoning over a dynamic interplay of verbal and non-verbal cues.<n>While Multimodal Large Language Models (MLLMs) are natural candidates, simply adding visual inputs yields surprisingly inconsistent gains on social tasks.<n>We propose a multimodal multi-speaker attention alignment method that can be integrated into existing MLLMs.
- Score: 23.550501177885625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding social interaction in video requires reasoning over a dynamic interplay of verbal and non-verbal cues: who is speaking, to whom, and with what gaze or gestures. While Multimodal Large Language Models (MLLMs) are natural candidates, simply adding visual inputs yields surprisingly inconsistent gains on social tasks. Our quantitative analysis of cross-modal attention inside state-of-the-art MLLMs reveals a core failure mode: in multi-speaker scenes, visual and textual tokens lack speaker-consistent alignment, exhibiting substantially weaker cross-modal attention than in object-centric images. To address this, we propose a multimodal multi-speaker attention alignment method that can be integrated into existing MLLMs. First, we introduce dynamic cross-modal head selection to identify attention heads most responsible for grounding. Then, an adaptive social-aware attention bias, computed from existing attention patterns and speaker locations, is injected into the attention mechanism. This bias reinforces alignment between a speaker's visual representation and their utterances without introducing trainable parameters or architectural changes. We integrate our method into three distinct MLLMs (LLaVA-NeXT-Video, Qwen2.5-VL, and InternVL3) and evaluate on three benchmarks (TVQA+, MMSI, OnlineMMSI). Across four social tasks, results demonstrate that our approach improves the ability of MLLMs and achieves state-of-the-art results. Attention visualizations confirm our method successfully focuses the model on speaker-relevant regions, enabling more robust multi-party social reasoning. Our implementation and model will be available at https://github.com/ut-vision/SocialInteraction.
Related papers
- AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding [73.05946667683259]
Recent large language models (MLLMs) show strong perception but struggle in multi-speaker, dialogue-centric settings.<n>We introduce AMUSE, a benchmark designed around tasks that are inherently agentic.<n>We propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation.
arXiv Detail & Related papers (2025-12-18T07:01:47Z) - MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind [41.188841829937466]
MoMentS (Multimodal Mental States) is a benchmark for building socially intelligent multimodal agents.<n>MoMentS includes over 2,300 multiple-choice questions spanning seven distinct ToM categories.<n>We evaluate several MLLMs and find that although vision generally improves performance, models still struggle to integrate it effectively.
arXiv Detail & Related papers (2025-07-06T15:06:30Z) - V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
arXiv Detail & Related papers (2025-04-08T15:43:01Z) - Towards Online Multi-Modal Social Interaction Understanding [36.37278022436327]
We propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams.<n>We develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting.<n>Our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI.
arXiv Detail & Related papers (2025-03-25T17:17:19Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction [114.35537839800372]
Speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge.<n>We propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information.<n>Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules.
arXiv Detail & Related papers (2025-01-03T18:59:52Z) - Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations [20.848802791989307]
We introduce three new challenges to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction.
We propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances.
Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions.
arXiv Detail & Related papers (2024-03-04T14:46:58Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - Face-to-Face Contrastive Learning for Social Intelligence
Question-Answering [55.90243361923828]
multimodal methods have set the state of the art on many tasks, but have difficulty modeling the complex face-to-face conversational dynamics.
We propose Face-to-Face Contrastive Learning (F2F-CL), a graph neural network designed to model social interactions.
We experimentally evaluated the challenging Social-IQ dataset and show state-of-the-art results.
arXiv Detail & Related papers (2022-07-29T20:39:44Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.