RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models
- URL: http://arxiv.org/abs/2503.19654v3
- Date: Sun, 30 Mar 2025 15:08:23 GMT
- Title: RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models
- Authors: Mehdi Moshtaghi, Siavash H. Khajavi, Joni Pajarinen,
- Abstract summary: We introduce RGB-Th-Bench, the first benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to comprehend RGB-Thermal image pairs.<n>We conduct extensive evaluations on 19 state-of-the-art VLMs, revealing significant performance gaps in RGB-Thermal understanding.<n>Our results show that even the strongest models struggle with thermal image comprehension, with performance heavily constrained by their RGB-based capabilities.
- Score: 11.050867144875435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce RGB-Th-Bench, the first benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to comprehend RGB-Thermal image pairs. While VLMs have demonstrated remarkable progress in visual reasoning and multimodal understanding, their evaluation has been predominantly limited to RGB-based benchmarks, leaving a critical gap in assessing their capabilities in infrared vision tasks. Existing visible-infrared datasets are either task-specific or lack high-quality annotations necessary for rigorous model evaluation. To address these limitations, RGB-Th-Bench provides a comprehensive evaluation framework covering 14 distinct skill dimensions, with a total of 1,600+ expert-annotated Yes/No questions. The benchmark employs two accuracy metrics: a standard question-level accuracy and a stricter skill-level accuracy, which evaluates model robustness across multiple questions within each skill dimension. This design ensures a thorough assessment of model performance, including resilience to adversarial and hallucinated responses. We conduct extensive evaluations on 19 state-of-the-art VLMs, revealing significant performance gaps in RGB-Thermal understanding. Our results show that even the strongest models struggle with thermal image comprehension, with performance heavily constrained by their RGB-based capabilities. Additionally, the lack of large-scale application-specific and expert-annotated thermal-caption-pair datasets in pre-training is an important reason of the observed performance gap. RGB-Th-Bench highlights the urgent need for further advancements in multimodal learning to bridge the gap between visible and thermal image understanding. The dataset is available through this link, and the evaluation code will also be made publicly available.
Related papers
- ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery [11.547362584832769]
Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images.<n> thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening.<n>We introduce ThermEval-B, a benchmark to assess the foundational primitives required for thermal vision language understanding.
arXiv Detail & Related papers (2026-02-16T18:16:19Z) - RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding [69.98331019544166]
Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality.<n>We propose RGBX-R1, a framework to enhance MLLM's perception and reasoning capacities across various X visual modalities.
arXiv Detail & Related papers (2026-01-31T04:13:57Z) - RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios [37.32297511767527]
We present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios.<n>It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels.<n>This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions.
arXiv Detail & Related papers (2025-12-31T02:01:02Z) - Rethinking Evaluation of Infrared Small Target Detection [105.59753496831739]
This paper introduces a hybrid-level metric incorporating pixel- and target-level performance, proposing a systematic error analysis method, and emphasizing the importance of cross-dataset evaluation.<n>An open-source toolkit has be released to facilitate standardized benchmarking.
arXiv Detail & Related papers (2025-09-21T02:45:07Z) - HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding [79.06209664703258]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z) - Spectral-Aware Global Fusion for RGB-Thermal Semantic Segmentation [10.761216101789774]
We propose the Spectral-aware Global Fusion Network (SGFNet) to enhance and fuse the multi-modal features.<n>SGFNet outperforms the state-of-the-art methods on the MFNet and PST900 datasets.
arXiv Detail & Related papers (2025-05-21T13:17:57Z) - KAN-SAM: Kolmogorov-Arnold Network Guided Segment Anything Model for RGB-T Salient Object Detection [35.52055285209549]
We propose a novel prompt learning-based RGB-T SOD method, named KAN-SAM, which reveals the potential of visual foundational models for RGB-T SOD tasks.
Specifically, we extend Segment Anything Model 2 (SAM2) for RGB-T SOD by introducing thermal features as guiding prompts through efficient and accurate Kolmogorov-Arnold Network (KAN) adapters.
We also introduce a mutually exclusive random masking strategy to reduce reliance on RGB data and improve generalization.
arXiv Detail & Related papers (2025-04-08T10:07:02Z) - IAM: Enhancing RGB-D Instance Segmentation with New Benchmarks [4.3266254914862445]
RGB-D segmentation promises richer scene understanding than RGB-only methods.<n>There is a relative scarcity of instance-level RGB-D segmentation datasets.<n>We introduce three RGB-D instance segmentation benchmarks, distinguished at the instance level.<n>We propose a simple yet effective method for RGB-D data integration.
arXiv Detail & Related papers (2025-01-03T08:03:24Z) - Leveraging Color Channel Independence for Improved Unsupervised Object Detection [7.030688465389997]
We challenge the common assumption that RGB images are the optimal color space for unsupervised learning in computer vision.<n>We show that models improve when requiring them to predict additional color channels.<n>The use of composite color spaces can be implemented with basically no computational overhead.
arXiv Detail & Related papers (2024-12-19T18:28:37Z) - Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer [10.982521876026281]
We introduce a diffusion-based framework to address the RGB-D semantic segmentation problem.
We demonstrate that utilizing a Deformable Attention Transformer as the encoder to extract features from depth images effectively captures the characteristics of invalid regions in depth measurements.
arXiv Detail & Related papers (2024-09-23T15:23:01Z) - Towards RGB-NIR Cross-modality Image Registration and Beyond [21.475871648254564]
This paper focuses on the area of RGB(visible)-NIR(near-infrared) cross-modality image registration.
We first present the RGB-NIR Image Registration (RGB-NIR-IRegis) benchmark, which, for the first time, enables fair and comprehensive evaluations.
We then design several metrics to reveal the toxic impact of inconsistent local features between visible and infrared images on the model performance.
arXiv Detail & Related papers (2024-05-30T10:25:50Z) - UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning [19.510261890672165]
We propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks.
Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (SFI) module, and a Supplementary Feature (SFI) module.
Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-04-26T12:21:57Z) - Confidence-Aware RGB-D Face Recognition via Virtual Depth Synthesis [48.59382455101753]
2D face recognition encounters challenges in unconstrained environments due to varying illumination, occlusion, and pose.
Recent studies focus on RGB-D face recognition to improve robustness by incorporating depth information.
In this work, we first construct a diverse depth dataset generated by 3D Morphable Models for depth model pre-training.
Then, we propose a domain-independent pre-training framework that utilizes readily available pre-trained RGB and depth models to separately perform face recognition without needing additional paired data for retraining.
arXiv Detail & Related papers (2024-03-11T09:12:24Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - RGB-D Saliency Detection via Cascaded Mutual Information Minimization [122.8879596830581]
Existing RGB-D saliency detection models do not explicitly encourage RGB and depth to achieve effective multi-modal learning.
We introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data.
arXiv Detail & Related papers (2021-09-15T12:31:27Z) - Learning Selective Mutual Attention and Contrast for RGB-D Saliency
Detection [145.4919781325014]
How to effectively fuse cross-modal information is the key problem for RGB-D salient object detection.
Many models use the feature fusion strategy but are limited by the low-order point-to-point fusion methods.
We propose a novel mutual attention model by fusing attention and contexts from different modalities.
arXiv Detail & Related papers (2020-10-12T08:50:10Z) - Siamese Network for RGB-D Salient Object Detection and Beyond [113.30063105890041]
A novel framework is proposed to learn from both RGB and depth inputs through a shared network backbone.
Comprehensive experiments using five popular metrics show that the designed framework yields a robust RGB-D saliency detector.
We also link JL-DCF to the RGB-D semantic segmentation field, showing its capability of outperforming several semantic segmentation models.
arXiv Detail & Related papers (2020-08-26T06:01:05Z) - RGB-D Salient Object Detection: A Survey [195.83586883670358]
We provide a comprehensive survey of RGB-D based SOD models from various perspectives.
We also review SOD models and popular benchmark datasets from this domain.
We discuss several challenges and open directions of RGB-D based SOD for future research.
arXiv Detail & Related papers (2020-08-01T10:01:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.