Related papers: DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

URL: http://arxiv.org/abs/2512.24985v2
Date: Tue, 06 Jan 2026 05:24:09 GMT
Title: DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
Authors: Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh,
Abstract summary: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents.<n>Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations.<n>We present DarkEQA, an open-source benchmark for evaluating EQA-relevant primitives under multi-level low-light conditions.
Score: 24.527536145236894
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Project website: https://darkeqa-benchmark.github.io/

Related papers

Zoom-IQA: Image Quality Assessment with Reliable Region-Aware Reasoning [32.30800226412995]
We introduce Zoom-IQA, a VLM-based IQA model to explicitly emulate key cognitive behaviors.<n>We show that Zoom-IQA achieves improved robustness, explainability, and generalization.<n>The application to downstream tasks, such as image restoration, further demonstrates the effectiveness of Zoom-IQA.
arXiv Detail & Related papers (2026-01-06T11:00:17Z)
Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation [18.67176370944511]
Real-world dark images commonly exhibit not only low visibility and contrast but also complex noise and blur, posing significant restoration challenges.<n>We propose a generative framework based on visual autoregressive ( VAR) modeling, guided by perceptual priors from the vision-language model (VLM)<n>Our framework is fully unsupervised and achieves state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2025-11-23T19:08:45Z)
NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment [39.76658525158528]
Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations.<n>We present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware.
arXiv Detail & Related papers (2025-11-06T18:23:55Z)
EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark [108.87311276892491]
EgoNight is the first comprehensive benchmark for nighttime egocentric vision.<n>Day-night aligned videos enhance night annotation quality.<n> EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types.
arXiv Detail & Related papers (2025-10-07T17:59:47Z)
Evaluating Robustness of Vision-Language Models Under Noisy Conditions [0.0176290054713643]
Vision-Language Models (VLMs) have attained exceptional success across multimodal tasks such as image captioning and visual question answering.<n>We present a comprehensive evaluation framework to evaluate the performance of several state-of-the-art VLMs under controlled perturbations.
arXiv Detail & Related papers (2025-09-15T22:31:21Z)
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding [50.72259772580637]
We introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent.<n>Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes.<n>We find that both complex-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes.
arXiv Detail & Related papers (2025-07-10T17:56:07Z)
EyeSim-VQA: A Free-Energy-Guided Eye Simulation Framework for Video Quality Assessment [68.77813885751308]
EyeSimVQA is a novel VQA framework that incorporates free-energy-based self-repair.<n>We show EyeSimVQA achieves competitive or superior performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-06-13T08:00:54Z)
VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning [70.44416154144001]
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks.<n> Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics.<n>We propose VisuRiddles, a benchmark for PRS, featuring tasks meticulously constructed to assess models' reasoning capacities.<n>Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions.
arXiv Detail & Related papers (2025-06-03T07:24:00Z)
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions [0.0]
Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques.<n>The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions.
arXiv Detail & Related papers (2025-03-13T13:07:51Z)
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models [0.6715525121432597]
This research presents a novel vision language model (VLM) framework to enhance feature extraction, scalability, and efficiency.<n>We evaluate the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise.<n>Our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV)
arXiv Detail & Related papers (2025-03-08T01:22:10Z)
Dusk Till Dawn: Self-supervised Nighttime Stereo Depth Estimation using Visual Foundation Models [16.792458193160407]
Self-supervised depth estimation algorithms rely heavily on frame-warping relationships. We introduce an algorithm designed to achieve accurate self-supervised stereo depth estimation focusing on nighttime conditions.
arXiv Detail & Related papers (2024-05-18T03:07:23Z)
NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision [64.83085920775316]
We introduce NPHardEval4V, a multimodal benchmark suite grounded in four classical NP-hard problems.<n>Each task is presented through a combination of structured visual layouts and textual prompts, designed to assess the ability of LVLMs to perform reasoning under visual-linguistic constraints.<n>Our results show that while these models perform reasonably well on perception-based inputs, they struggle with global optimization, abstraction, and constraint satisfaction.
arXiv Detail & Related papers (2024-03-04T07:10:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.