Towards Online Multi-Modal Social Interaction Understanding
- URL: http://arxiv.org/abs/2503.19851v1
- Date: Tue, 25 Mar 2025 17:17:19 GMT
- Title: Towards Online Multi-Modal Social Interaction Understanding
- Authors: Xinpeng Li, Shijian Deng, Bolin Lai, Weiguo Pian, James M. Rehg, Yapeng Tian,
- Abstract summary: We propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams.<n>We develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting.<n>Our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI.
- Score: 36.37278022436327
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multimodal social interaction understanding (MMSI) is critical in human-robot interaction systems. In real-world scenarios, AI agents are required to provide real-time feedback. However, existing models often depend on both past and future contexts, which hinders them from applying to real-world problems. To bridge this gap, we propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams. To address the challenges of missing the useful future context, we develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting with multi-modal large language models. First, to enrich linguistic context, the multi-party conversation forecasting simulates potential future utterances in a coarse-to-fine manner, anticipating upcoming speaker turns and then generating fine-grained conversational details. Second, to effectively incorporate visual social cues like gaze and gesture, social-aware visual prompting highlights the social dynamics in video with bounding boxes and body keypoints for each person and frame. Extensive experiments on three tasks and two datasets demonstrate that our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI. The code and pre-trained models will be publicly released at: https://github.com/Sampson-Lee/OnlineMMSI.
Related papers
- Towards Anthropomorphic Conversational AI Part I: A Practical Framework [49.62013440962072]
We introduce a multi- module framework designed to replicate the key aspects of human intelligence involved in conversations.<n>In the second stage of our approach, these conversational data, after filtering and labeling, can serve as training and testing data for reinforcement learning.
arXiv Detail & Related papers (2025-02-28T03:18:39Z) - Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations [20.848802791989307]
We introduce three new challenges to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction.
We propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances.
Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions.
arXiv Detail & Related papers (2024-03-04T14:46:58Z) - MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms [25.73585435351771]
This paper introduces MM-Soc, a benchmark designed to evaluate Multimodal Large Language Models' understanding of social media content.
MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset.
Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks.
arXiv Detail & Related papers (2024-02-21T22:27:40Z) - SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM)
SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation.
Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Face-to-Face Contrastive Learning for Social Intelligence
Question-Answering [55.90243361923828]
multimodal methods have set the state of the art on many tasks, but have difficulty modeling the complex face-to-face conversational dynamics.
We propose Face-to-Face Contrastive Learning (F2F-CL), a graph neural network designed to model social interactions.
We experimentally evaluated the challenging Social-IQ dataset and show state-of-the-art results.
arXiv Detail & Related papers (2022-07-29T20:39:44Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Multi-Modal Open-Domain Dialogue [28.69395893943413]
Recent work in open-domain conversational agents has demonstrated that significant improvements in model engagingness and humanness metrics can be achieved via massive scaling.
We investigate combining components from state-of-the-art open-domain dialogue agents with those from state-of-the-art vision models.
We show that our best resulting model outperforms strong existing models in multi-modal dialogue while simultaneously performing as well as its predecessor.
arXiv Detail & Related papers (2020-10-02T16:20:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.