Related papers: Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

URL: http://arxiv.org/abs/2603.03447v1
Date: Tue, 03 Mar 2026 19:02:46 GMT
Title: Proact-VL: A Proactive VideoLLM for Real-Time AI Companions
Authors: Weicai Yan, Yuhong Dai, Qi Ran, Haodong Li, Wang Lin, Hao Liao, Xing Xie, Tao Jin, Jianxun Lian,
Abstract summary: We instantiate AI companions through two gaming scenarios, commentator and guide, selected for automatic evaluation.<n>We present Proact-VL, a framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction.
Score: 52.23988809605433
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

Related papers

LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks [71.05217306468857]
LifeEval is a multimodal benchmark designed to evaluate real-time, task-oriented human-AI collaboration in daily life.<n>LifeEval emphasizes three key aspects: task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human-assistant collaborative interaction through natural dialogues.
arXiv Detail & Related papers (2026-02-28T06:05:31Z)
TeleEgo: Benchmarking Egocentric AI Assistants in the Wild [55.53194302888826]
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text)<n>We introduce textbfTeleEgo, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants.<n>The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains.
arXiv Detail & Related papers (2025-10-28T01:24:24Z)
PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments [36.84821207878773]
Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings.<n>We introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments.<n>We present a benchmark featuring multi-round interactive environments designed to assess both reasoning and information-gathering efficiency.
arXiv Detail & Related papers (2025-10-24T02:59:00Z)
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting [66.90028121194636]
Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm.<n>VITA-E is a novel embodied interaction framework designed for both behavioral and nearly real-time interruption.
arXiv Detail & Related papers (2025-10-21T17:59:56Z)
Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video [36.94345183020698]
We focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment.<n>This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency.<n>We propose a comprehensive technical pipeline to enable models to tackle this challenging task.
arXiv Detail & Related papers (2025-10-16T11:11:13Z)
FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z)
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving [16.602141801221364]
STSBench is a framework to benchmark holistic understanding of vision-language models (VLMs) for autonomous driving.<n>The benchmark features 43 diverse scenarios spanning multiple views, resulting in 971 human-verified multiple-choice questions.<n>A thorough evaluation uncovers shortcomings in existing models' ability to reason about fundamental traffic dynamics in complex environments.
arXiv Detail & Related papers (2025-06-06T16:25:22Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction [5.958765450103163]
We present the QEVD benchmark and dataset, which explores human-AI interaction in the challenging, yet controlled, real-world domain of fitness coaching.<n>The benchmark requires vision-language models to recognize complex human actions, identify possible mistakes, and provide appropriate feedback in real-time.<n>Motivated by this, we propose a simple end-to-end streaming baseline that can respond asynchronously to human actions with appropriate feedback at the appropriate time.
arXiv Detail & Related papers (2024-07-11T00:10:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.