Related papers: LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks

LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks

URL: http://arxiv.org/abs/2603.00490v1
Date: Sat, 28 Feb 2026 06:05:31 GMT
Title: LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
Authors: Hengjian Gao, Kaiwei Zhang, Shibo Wang, Mingjie Chen, Qihang Cao, Xianfeng Wang, Yucheng Zhu, Xiongkuo Min, Wei Sun, Dandan Zhu, Guangtao Zhai,
Abstract summary: LifeEval is a multimodal benchmark designed to evaluate real-time, task-oriented human-AI collaboration in daily life.<n>LifeEval emphasizes three key aspects: task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human-assistant collaborative interaction through natural dialogues.
Score: 71.05217306468857
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid progress of Multimodal Large Language Models (MLLMs) marks a significant step toward artificial general intelligence, offering great potential for augmenting human capabilities. However, their ability to provide effective assistance in dynamic, real-world environments remains largely underexplored. Existing video benchmarks predominantly assess passive understanding through retrospective analysis or isolated perception tasks, failing to capture the interactive and adaptive nature of real-time user assistance. To bridge this gap, we introduce LifeEval, a multimodal benchmark designed to evaluate real-time, task-oriented human-AI collaboration in daily life from an egocentric perspective. LifeEval emphasizes three key aspects: task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human-assistant collaborative interaction through natural dialogues. Constructed via a rigorous annotation pipeline, the benchmark comprises 4,075 high-quality question-answer pairs across 6 core capability dimensions. Extensive evaluations of 26 state-of-the-art MLLMs on LifeEval reveal substantial challenges in achieving timely, effective and adaptive interaction, highlighting essential directions for advancing human-centered interactive intelligence.

Related papers

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions [52.23988809605433]
We instantiate AI companions through two gaming scenarios, commentator and guide, selected for automatic evaluation.<n>We present Proact-VL, a framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction.
arXiv Detail & Related papers (2026-03-03T19:02:46Z)
EgoEMS: A High-Fidelity Multimodal Egocentric Dataset for Cognitive Assistance in Emergency Medical Services [3.0776354206437664]
EgoEMS is the first end-to-end, high-fidelity, multimodal, multiperson dataset capturing over 20 hours of realistic, procedural EMS activities.<n>Developed in collaboration with EMS experts and aligned with national standards, EgoEMS is captured using an open-source, low-cost, and replicable data collection system.<n>We present a suite of benchmarks for real-time multimodal keystep recognition and action quality estimation, essential for developing AI support tools for EMS.
arXiv Detail & Related papers (2025-11-13T02:55:40Z)
TeleEgo: Benchmarking Egocentric AI Assistants in the Wild [55.53194302888826]
Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text)<n>We introduce textbfTeleEgo, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants.<n>The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains.
arXiv Detail & Related papers (2025-10-28T01:24:24Z)
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models [66.833085504228]
We introduce V4DLM, the first benchmark specifically designed to evaluate visual language models (VLMs)<n>Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs.<n>We identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models.
arXiv Detail & Related papers (2025-08-04T06:06:06Z)
SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions [36.010107260144586]
SoMi-ToM benchmark is designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions.<n>We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions.<n>Results show that LVLMs perform significantly worse than humans on SoMi-ToM.
arXiv Detail & Related papers (2025-06-29T00:54:13Z)
EgoLife: Towards Egocentric Life Assistant [60.51196061794498]
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses.<n>We conduct a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references.<n>This effort resulted in the EgoLife dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation.<n>We introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide
arXiv Detail & Related papers (2025-03-05T18:54:16Z)
HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.