Related papers: TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

URL: http://arxiv.org/abs/2510.23981v2
Date: Thu, 30 Oct 2025 07:09:32 GMT
Title: TeleEgo: Benchmarking Egocentric AI Assistants in the Wild
Authors: Jiaqi Yan, Ruilong Ren, Jingren Liu, Shuning Xu, Ling Wang, Yiheng Wang, Yun Wang, Long Zhang, Xiangyu Chen, Changzhi Sun, Jixiang Luo, Dell Zhang, Hao Sun, Chi Zhang, Xuelong Li,
Abstract summary: Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text)<n>We introduce textbfTeleEgo, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants.<n>The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains.
Score: 55.53194302888826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics -- Real-Time Accuracy and Memory Persistence Time -- to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.

Related papers

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions [52.23988809605433]
We instantiate AI companions through two gaming scenarios, commentator and guide, selected for automatic evaluation.<n>We present Proact-VL, a framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction.
arXiv Detail & Related papers (2026-03-03T19:02:46Z)
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models [93.844257719952]
We introduce the Game-Time Benchmark framework to assess temporal capabilities.<n>Our evaluation of diverse SLM models reveals a clear performance disparity.<n>The GameTime Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI.
arXiv Detail & Related papers (2025-09-30T15:23:39Z)
Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions [110.43343503158306]
This paper embeds the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands.<n>Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data.<n>We establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis.
arXiv Detail & Related papers (2025-08-06T17:46:23Z)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding [120.84817886550765]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
Proactive Assistant Dialogue Generation from Streaming Egocentric Videos [48.30863954384779]
This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks.<n>First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos.<n>Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies.<n>Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses.
arXiv Detail & Related papers (2025-06-06T09:23:29Z)
EgoLife: Towards Egocentric Life Assistant [60.51196061794498]
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses.<n>We conduct a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references.<n>This effort resulted in the EgoLife dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation.<n>We introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide
arXiv Detail & Related papers (2025-03-05T18:54:16Z)
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.<n>This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z)
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains [4.9347081318119015]
We introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering.<n>In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g. visual and speech) information.<n>We discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning.
arXiv Detail & Related papers (2023-11-30T18:19:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.