DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
- URL: http://arxiv.org/abs/2512.24985v2
- Date: Tue, 06 Jan 2026 05:24:09 GMT
- Title: DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
- Authors: Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh,
- Abstract summary: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents.<n>Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations.<n>We present DarkEQA, an open-source benchmark for evaluating EQA-relevant primitives under multi-level low-light conditions.
- Score: 24.527536145236894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Project website: https://darkeqa-benchmark.github.io/
Related papers
- Zoom-IQA: Image Quality Assessment with Reliable Region-Aware Reasoning [32.30800226412995]
We introduce Zoom-IQA, a VLM-based IQA model to explicitly emulate key cognitive behaviors.<n>We show that Zoom-IQA achieves improved robustness, explainability, and generalization.<n>The application to downstream tasks, such as image restoration, further demonstrates the effectiveness of Zoom-IQA.
arXiv Detail & Related papers (2026-01-06T11:00:17Z) - Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation [18.67176370944511]
Real-world dark images commonly exhibit not only low visibility and contrast but also complex noise and blur, posing significant restoration challenges.<n>We propose a generative framework based on visual autoregressive ( VAR) modeling, guided by perceptual priors from the vision-language model (VLM)<n>Our framework is fully unsupervised and achieves state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2025-11-23T19:08:45Z) - NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment [39.76658525158528]
Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations.<n>We present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware.
arXiv Detail & Related papers (2025-11-06T18:23:55Z) - EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark [108.87311276892491]
EgoNight is the first comprehensive benchmark for nighttime egocentric vision.<n>Day-night aligned videos enhance night annotation quality.<n> EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types.
arXiv Detail & Related papers (2025-10-07T17:59:47Z) - Evaluating Robustness of Vision-Language Models Under Noisy Conditions [0.0176290054713643]
Vision-Language Models (VLMs) have attained exceptional success across multimodal tasks such as image captioning and visual question answering.<n>We present a comprehensive evaluation framework to evaluate the performance of several state-of-the-art VLMs under controlled perturbations.
arXiv Detail & Related papers (2025-09-15T22:31:21Z) - OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding [50.72259772580637]
We introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent.<n>Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes.<n>We find that both complex-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes.
arXiv Detail & Related papers (2025-07-10T17:56:07Z) - EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment [68.77813885751308]
EyeSimVQA is a novel VQA framework that incorporates free-energy-based self-repair.<n>We show EyeSimVQA achieves competitive or superior performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-06-13T08:00:54Z) - VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning [70.44416154144001]
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks.<n> Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics.<n>We propose VisuRiddles, a benchmark for PRS, featuring tasks meticulously constructed to assess models' reasoning capacities.<n>Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions.
arXiv Detail & Related papers (2025-06-03T07:24:00Z) - V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
arXiv Detail & Related papers (2025-04-08T15:43:01Z) - OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions [0.0]
Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques.<n>The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions.
arXiv Detail & Related papers (2025-03-13T13:07:51Z) - Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models [0.6715525121432597]
This research presents a novel vision language model (VLM) framework to enhance feature extraction, scalability, and efficiency.<n>We evaluate the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise.<n>Our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV)
arXiv Detail & Related papers (2025-03-08T01:22:10Z) - Dusk Till Dawn: Self-supervised Nighttime Stereo Depth Estimation using Visual Foundation Models [16.792458193160407]
Self-supervised depth estimation algorithms rely heavily on frame-warping relationships.
We introduce an algorithm designed to achieve accurate self-supervised stereo depth estimation focusing on nighttime conditions.
arXiv Detail & Related papers (2024-05-18T03:07:23Z) - NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision [64.83085920775316]
We introduce NPHardEval4V, a multimodal benchmark suite grounded in four classical NP-hard problems.<n>Each task is presented through a combination of structured visual layouts and textual prompts, designed to assess the ability of LVLMs to perform reasoning under visual-linguistic constraints.<n>Our results show that while these models perform reasonably well on perception-based inputs, they struggle with global optimization, abstraction, and constraint satisfaction.
arXiv Detail & Related papers (2024-03-04T07:10:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.