Proact-VL: A Proactive VideoLLM for Real-Time AI Companions
- URL: http://arxiv.org/abs/2603.03447v1
- Date: Tue, 03 Mar 2026 19:02:46 GMT
- Title: Proact-VL: A Proactive VideoLLM for Real-Time AI Companions
- Authors: Weicai Yan, Yuhong Dai, Qi Ran, Haodong Li, Wang Lin, Hao Liao, Xing Xie, Tao Jin, Jianxun Lian,
- Abstract summary: We instantiate AI companions through two gaming scenarios, commentator and guide, selected for automatic evaluation.<n>We present Proact-VL, a framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction.
- Score: 52.23988809605433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.
Related papers
- LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks [71.05217306468857]
LifeEval is a multimodal benchmark designed to evaluate real-time, task-oriented human-AI collaboration in daily life.<n>LifeEval emphasizes three key aspects: task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human-assistant collaborative interaction through natural dialogues.
arXiv Detail & Related papers (2026-02-28T06:05:31Z) - TeleEgo: Benchmarking Egocentric AI Assistants in the Wild [55.53194302888826]
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text)<n>We introduce textbfTeleEgo, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants.<n>The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains.
arXiv Detail & Related papers (2025-10-28T01:24:24Z) - PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments [36.84821207878773]
Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings.<n>We introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments.<n>We present a benchmark featuring multi-round interactive environments designed to assess both reasoning and information-gathering efficiency.
arXiv Detail & Related papers (2025-10-24T02:59:00Z) - VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting [66.90028121194636]
Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm.<n>VITA-E is a novel embodied interaction framework designed for both behavioral and nearly real-time interruption.
arXiv Detail & Related papers (2025-10-21T17:59:56Z) - Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video [36.94345183020698]
We focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment.<n>This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency.<n>We propose a comprehensive technical pipeline to enable models to tackle this challenging task.
arXiv Detail & Related papers (2025-10-16T11:11:13Z) - FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z) - STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving [16.602141801221364]
STSBench is a framework to benchmark holistic understanding of vision-language models (VLMs) for autonomous driving.<n>The benchmark features 43 diverse scenarios spanning multiple views, resulting in 971 human-verified multiple-choice questions.<n>A thorough evaluation uncovers shortcomings in existing models' ability to reason about fundamental traffic dynamics in complex environments.
arXiv Detail & Related papers (2025-06-06T16:25:22Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction [5.958765450103163]
We present the QEVD benchmark and dataset, which explores human-AI interaction in the challenging, yet controlled, real-world domain of fitness coaching.<n>The benchmark requires vision-language models to recognize complex human actions, identify possible mistakes, and provide appropriate feedback in real-time.<n>Motivated by this, we propose a simple end-to-end streaming baseline that can respond asynchronously to human actions with appropriate feedback at the appropriate time.
arXiv Detail & Related papers (2024-07-11T00:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.