Related papers: CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences

CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences

URL: http://arxiv.org/abs/2512.10918v1
Date: Thu, 11 Dec 2025 18:44:44 GMT
Title: CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences
Authors: Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball, Alex Cabral, Josiah Hester,
Abstract summary: Social presence is central to the enjoyment of watching content together, yet modern media consumption is increasingly solitary.<n>We investigate whether multi-agent conversational AI systems can recreate the dynamics of shared viewing experiences across diverse content types.<n>We present CompanionCast, a general framework for orchestrating multiple role-specialized AI agents that respond to video content.
Score: 10.985715950187519
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Social presence is central to the enjoyment of watching content together, yet modern media consumption is increasingly solitary. We investigate whether multi-agent conversational AI systems can recreate the dynamics of shared viewing experiences across diverse content types. We present CompanionCast, a general framework for orchestrating multiple role-specialized AI agents that respond to video content using multimodal inputs, speech synthesis, and spatial audio. Distinctly, CompanionCast integrates an LLM-as-a-Judge module that iteratively scores and refines conversations across five dimensions (relevance, authenticity, engagement, diversity, personality consistency). We validate this framework through sports viewing, a domain with rich dynamics and strong social traditions, where a pilot study with soccer fans suggests that multi-agent interaction improves perceived social presence compared to solo viewing. We contribute: (1) a generalizable framework for orchestrating multi-agent conversations around multimodal video content, (2) a novel evaluator-agent pipeline for conversation quality control, and (3) exploratory evidence of increased social presence in AI-mediated co-viewing. We discuss challenges and future directions for applying this approach to diverse viewing contexts including entertainment, education, and collaborative watching experiences.

Related papers

MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation [59.23161833385837]
We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation.<n>Our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.
arXiv Detail & Related papers (2025-12-02T18:55:53Z)
MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind [41.188841829937466]
MoMentS (Multimodal Mental States) is a benchmark for building socially intelligent multimodal agents.<n>MoMentS includes over 2,300 multiple-choice questions spanning seven distinct ToM categories.<n>We evaluate several MLLMs and find that although vision generally improves performance, models still struggle to integrate it effectively.
arXiv Detail & Related papers (2025-07-06T15:06:30Z)
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning [53.16179295245888]
We introduce SIV-Bench, a novel video benchmark for evaluating the capabilities of Multimodal Large Language Models (MLLMs) across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP)<n>SIV-Bench features 2,792 video clips and 8,792 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline.<n>It also includes a dedicated setup for analyzing the impact of different textual cues-original on-screen text, added dialogue, or no text.
arXiv Detail & Related papers (2025-06-05T05:51:35Z)
Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions [13.341099059080936]
This study aims to equip chatbots with "eyes and ears" capable of more immersive interactions with humans.<n>We introduce a new multimodal conversation dataset, Multimodal Multi-Session Multi-Party Conversation.<n>Our model, trained on the $M3C$, demonstrates the ability to seamlessly engage in long-term conversations with multiple speakers.
arXiv Detail & Related papers (2025-05-31T06:50:51Z)
Towards Online Multi-Modal Social Interaction Understanding [36.37278022436327]
We propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams.<n>We develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting.<n>Our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI.
arXiv Detail & Related papers (2025-03-25T17:17:19Z)
Towards Anthropomorphic Conversational AI Part I: A Practical Framework [49.62013440962072]
We introduce a multi- module framework designed to replicate the key aspects of human intelligence involved in conversations.<n>In the second stage of our approach, these conversational data, after filtering and labeling, can serve as training and testing data for reinforcement learning.
arXiv Detail & Related papers (2025-02-28T03:18:39Z)
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning.<n>We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks.<n>This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z)
SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM) SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z)
LiveChat: Video Comment Generation from Audio-Visual Multimodal Contexts [8.070778830276275]
We create a large-scale audio-visual multimodal dialogue dataset to facilitate the development of live commenting technologies. The data is collected from Twitch, with 11 different categories and 575 streamers for a total of 438 hours of video and 3.2 million comments. We propose a novel multimodal generation model capable of generating live comments that align with the temporal and spatial events within the video.
arXiv Detail & Related papers (2023-10-01T02:35:58Z)
VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles [63.32111010686954]
We propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO) The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article. We propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator.
arXiv Detail & Related papers (2020-10-12T02:19:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.