ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery
- URL: http://arxiv.org/abs/2602.14989v1
- Date: Mon, 16 Feb 2026 18:16:19 GMT
- Title: ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery
- Authors: Ayush Shrivastava, Kirtan Gangani, Laksh Jain, Mayank Goel, Nipun Batra,
- Abstract summary: Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images.<n> thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening.<n>We introduce ThermEval-B, a benchmark to assess the foundational primitives required for thermal vision language understanding.
- Score: 11.547362584832769
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.
Related papers
- TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation [12.591408054941027]
TherA is a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level.<n>TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.
arXiv Detail & Related papers (2026-02-23T01:56:29Z) - GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models [61.786094845872576]
We propose GenColorBench, the first comprehensive benchmark for text-to-image color generation.<n>It is grounded in color systems like I SCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere.<n>With 44K color-focused prompts covering 400+ colors, it reveals models' true capabilities via perceptual and automated assessments.
arXiv Detail & Related papers (2025-10-23T14:12:55Z) - ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation [14.108149959967095]
Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks.<n>To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution.<n>We propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation.
arXiv Detail & Related papers (2025-09-29T14:55:51Z) - RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models [11.050867144875435]
We introduce RGB-Th-Bench, the first benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to comprehend RGB-Thermal image pairs.<n>We conduct extensive evaluations on 19 state-of-the-art VLMs, revealing significant performance gaps in RGB-Thermal understanding.<n>Our results show that even the strongest models struggle with thermal image comprehension, with performance heavily constrained by their RGB-based capabilities.
arXiv Detail & Related papers (2025-03-25T13:43:47Z) - ThermoNeRF: Joint RGB and Thermal Novel View Synthesis for Building Facades using Multimodal Neural Radiance Fields [5.66229031510643]
Thermal scene reconstruction holds great potential for various applications, such as analyzing building energy consumption and performing non-destructive infrastructure testing.<n>Existing methods typically require dense scene measurements and often rely on RGB images for 3D geometry reconstruction, projecting thermal information post-reconstruction.<n>We propose ThermoNeRF, a novel approach based on Neural Radiance Fields that jointly renders new RGB and thermal views of a scene, and ThermoScenes, a dataset of paired RGB+thermal images comprising 8 scenes of building facades and 8 scenes of everyday objects.
arXiv Detail & Related papers (2024-03-18T18:10:34Z) - Does Thermal Really Always Matter for RGB-T Salient Object Detection? [153.17156598262656]
This paper proposes a network named TNet to solve the RGB-T salient object detection (SOD) task.
In this paper, we introduce a global illumination estimation module to predict the global illuminance score of the image.
On the other hand, we introduce a two-stage localization and complementation module in the decoding phase to transfer object localization cue and internal integrity cue in thermal features to the RGB modality.
arXiv Detail & Related papers (2022-10-09T13:50:12Z) - Maximizing Self-supervision from Thermal Image for Effective
Self-supervised Learning of Depth and Ego-motion [78.19156040783061]
Self-supervised learning of depth and ego-motion from thermal images shows strong robustness and reliability under challenging scenarios.
The inherent thermal image properties such as weak contrast, blurry edges, and noise hinder to generate effective self-supervision from thermal images.
We propose an effective thermal image mapping method that significantly increases image information, such as overall structure, contrast, and details, while preserving temporal consistency.
arXiv Detail & Related papers (2022-01-12T09:49:24Z) - Meta-UDA: Unsupervised Domain Adaptive Thermal Object Detection using
Meta-Learning [64.92447072894055]
Infrared (IR) cameras are robust under adverse illumination and lighting conditions.
We propose an algorithm meta-learning framework to improve existing UDA methods.
We produce a state-of-the-art thermal detector for the KAIST and DSIAC datasets.
arXiv Detail & Related papers (2021-10-07T02:28:18Z) - A Large-Scale, Time-Synchronized Visible and Thermal Face Dataset [62.193924313292875]
We present the DEVCOM Army Research Laboratory Visible-Thermal Face dataset (ARL-VTF)
With over 500,000 images from 395 subjects, the ARL-VTF dataset represents to the best of our knowledge, the largest collection of paired visible and thermal face images to date.
This paper presents benchmark results and analysis on thermal face landmark detection and thermal-to-visible face verification by evaluating state-of-the-art models on the ARL-VTF dataset.
arXiv Detail & Related papers (2021-01-07T17:17:12Z) - Exploring Thermal Images for Object Detection in Underexposure Regions
for Autonomous Driving [67.69430435482127]
Underexposure regions are vital to construct a complete perception of the surroundings for safe autonomous driving.
The availability of thermal cameras has provided an essential alternate to explore regions where other optical sensors lack in capturing interpretable signals.
This work proposes a domain adaptation framework which employs a style transfer technique for transfer learning from visible spectrum images to thermal images.
arXiv Detail & Related papers (2020-06-01T09:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.