EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark
- URL: http://arxiv.org/abs/2510.06218v1
- Date: Tue, 07 Oct 2025 17:59:47 GMT
- Title: EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark
- Authors: Deheng Zhang, Yuqian Fu, Runyi Yang, Yang Miao, Tianwen Qian, Xu Zheng, Guolei Sun, Ajad Chhatkuli, Xuanjing Huang, Yu-Gang Jiang, Luc Van Gool, Danda Pani Paudel,
- Abstract summary: EgoNight is the first comprehensive benchmark for nighttime egocentric vision.<n>Day-night aligned videos enhance night annotation quality.<n> EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types.
- Score: 108.87311276892491
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.
Related papers
- DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments [24.527536145236894]
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents.<n>Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations.<n>We present DarkEQA, an open-source benchmark for evaluating EQA-relevant primitives under multi-level low-light conditions.
arXiv Detail & Related papers (2025-12-31T17:31:29Z) - Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence [70.2803680525165]
We introduce Open-o3 Video, a non-agent framework that integrates explicit evidence into video reasoning.<n>The model highlights key objects and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations.<n>On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mL timestamp by 24.2%.
arXiv Detail & Related papers (2025-10-23T14:05:56Z) - Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset [20.470784087903514]
Oxford Day-and-Night is a large-scale, egocentric dataset for novel view synthesis (NVS) and visual relocalisation under challenging lighting conditions.<n>It supports two core benchmarks, NVS and relocalisation, providing a unique platform for evaluating models in realistic and diverse environments.
arXiv Detail & Related papers (2025-06-04T17:59:02Z) - Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos [51.8995932557911]
EgoTempo is a dataset designed to evaluate temporal understanding in the egocentric domain.<n>We show that state-of-the-art Multi-Modal Large Language Models (MLLMs) on benchmarks achieve remarkably high performance using just text or a single frame as input.<n>We hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics.
arXiv Detail & Related papers (2025-03-17T18:50:36Z) - EgoBlind: Towards Egocentric Visual Assistance for the Blind [69.6161191190939]
EgoBlind is the first egocentric VideoQA dataset collected from blind individuals.<n>It comprises 1,392 videos that record the daily lives of real blind users from a first-person perspective.<n>It also features 5,311 questions directly posed or generated by blind individuals to reflect their in-situation needs for visual assistance.
arXiv Detail & Related papers (2025-03-11T09:40:31Z) - ESVQA: Perceptual Quality Assessment of Egocentric Spatial Videos [71.62145804686062]
Assessing the quality of experience (QoE) of egocentric spatial videos is crucial to ensure a high-quality viewing experience.<n>We introduce the first Egocentric Spatial Video Quality Assessment Database (ESVQAD), which comprises 600 egocentric spatial videos.<n>We also propose a novel multi-dimensional binocular feature fusion model, termed ESVQAnet, which integrates binocular spatial, motion, and semantic features to predict the overall perceptual quality.
arXiv Detail & Related papers (2024-12-29T10:13:30Z) - MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding.<n>We automatically generate 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long in Ego4D based on human-annotated data.<n>We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z) - Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions? [48.702973928321946]
Egocentric video-language pretraining is a crucial step in advancing the understanding of hand-object interactions in first-person scenarios.<n>Despite successes on existing testbeds, we find that current EgoVLMs can be easily misled by simple modifications.<n>This raises the question: Do EgoVLMs truly understand hand-object interactions?
arXiv Detail & Related papers (2024-05-28T00:27:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.