UNIV: Unified Foundation Model for Infrared and Visible Modalities
- URL: http://arxiv.org/abs/2509.15642v1
- Date: Fri, 19 Sep 2025 06:07:53 GMT
- Title: UNIV: Unified Foundation Model for Infrared and Visible Modalities
- Authors: Fangyuan Mao, Shuo Wang, Jilin Mei, Chen Min, Shun Lu, Fuyang Liu, Yu Hu,
- Abstract summary: We propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV)<n>PCCL is an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition.<n>Our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing.
- Score: 12.0490466425884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The demand for joint RGB-visible and infrared perception is growing rapidly, particularly to achieve robust performance under diverse weather conditions. Although pre-trained models for RGB-visible and infrared data excel in their respective domains, they often underperform in multimodal scenarios, such as autonomous vehicles equipped with both sensors. To address this challenge, we propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV), featuring two key innovations. First, we introduce Patch-wise Cross-modality Contrastive Learning (PCCL), an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition, which enables effective cross-modal feature alignment while remaining compatible with any transformer-based architecture. Second, our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing - combining LoRA adapters (2% added parameters) with synchronous distillation to prevent catastrophic forgetting, thereby replicating the retina's photopic (cone-driven) and scotopic (rod-driven) functionality. To support cross-modal learning, we introduce the MVIP dataset, the most comprehensive visible-infrared benchmark to date. It contains 98,992 precisely aligned image pairs spanning diverse scenarios. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU in semantic segmentation and +0.7 mAP in object detection) while maintaining 99%+ of the baseline performance on visible RGB tasks. Our code is available at https://github.com/fangyuanmao/UNIV.
Related papers
- MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration [63.87179575890912]
We propose Motion Integration DETR (MI-DETR), a bio-inspired dual-pathway detector for infrared small target detection.<n>First, a retina-inspired cellular automaton (RCA) converts raw frame sequences into a motion map defined on the same pixel grid as the appearance image.<n>Second, a Parvocellular-Magnocellular Interconnection (PMI) Block facilitates bidirectional feature interaction between the two pathways.
arXiv Detail & Related papers (2026-03-05T11:39:31Z) - IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection [92.56025546608699]
IrisNet is a novel meta-learned framework that adapts detection strategies to the input infrared image status.<n>Our approach establishes a dynamic mapping between infrared image features and entire decoder parameters.<n> Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet.
arXiv Detail & Related papers (2025-11-25T13:53:54Z) - MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery [1.005854289245731]
Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter.<n>We introduce Mamba-YOLO, a fusion module that balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms.<n>Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.
arXiv Detail & Related papers (2025-11-24T13:59:01Z) - CDUPatch: Color-Driven Universal Adversarial Patch Attack for Dual-Modal Visible-Infrared Detectors [6.8163437709379835]
Adversarial patches are widely used to evaluate the robustness of object detection systems in real-world scenarios.<n>We propose CDUPatch, a universal cross-modal patch attack against visible-infrared object detectors across scales, views, and scenarios.<n>By learning an optimal color distribution on the adversarial patch, we can manipulate its thermal response and generate an adversarial infrared texture.
arXiv Detail & Related papers (2025-04-15T05:46:00Z) - Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset [65.76480665062363]
Human Activity Recognition primarily relied on traditional RGB cameras to achieve high-performance activity recognition.<n>Challenges in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras.<n>In this work, we rethink human activity recognition by combining the RGB and event cameras.
arXiv Detail & Related papers (2025-04-08T09:14:24Z) - Multi-Domain Biometric Recognition using Body Embeddings [51.36007967653781]
We show that body embeddings perform better than face embeddings in medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains.<n>We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset.<n>We also show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art mAP scores.
arXiv Detail & Related papers (2025-03-13T22:38:18Z) - Bringing RGB and IR Together: Hierarchical Multi-Modal Enhancement for Robust Transmission Line Detection [67.02804741856512]
We propose a novel Hierarchical Multi-Modal Enhancement Network (HMMEN) that integrates RGB and IR data for robust and accurate TL detection.<n>Our method introduces two key components: (1) a Mutual Multi-Modal Enhanced Block (MMEB), which fuses and enhances hierarchical RGB and IR feature maps in a coarse-to-fine manner, and (2) a Feature Alignment Block (FAB) that corrects misalignments between decoder outputs and IR feature maps by leveraging deformable convolutions.
arXiv Detail & Related papers (2025-01-25T06:21:06Z) - MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection [12.462709547836289]
Using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive task such as object detection (OD)
In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder.
This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder, while countering the effects of modality imbalance.
arXiv Detail & Related papers (2024-04-29T16:42:58Z) - UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning [19.510261890672165]
We propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks.<n>Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (SFI) module, and a Supplementary Feature (SFI) module.<n> Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-04-26T12:21:57Z) - CycleTrans: Learning Neutral yet Discriminative Features for
Visible-Infrared Person Re-Identification [79.84912525821255]
Visible-infrared person re-identification (VI-ReID) is a task of matching the same individuals across the visible and infrared modalities.
Existing VI-ReID methods mainly focus on learning general features across modalities, often at the expense of feature discriminability.
We present a novel cycle-construction-based network for neutral yet discriminative feature learning, termed CycleTrans.
arXiv Detail & Related papers (2022-08-21T08:41:40Z) - Mirror Complementary Transformer Network for RGB-thermal Salient Object
Detection [16.64781797503128]
RGB-thermal object detection (RGB-T SOD) aims to locate the common prominent objects of an aligned visible and thermal infrared image pair.
In this paper, we propose a novel mirror complementary Transformer network (MCNet) for RGB-T SOD.
Experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2022-07-07T20:26:09Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Drone-based RGB-Infrared Cross-Modality Vehicle Detection via
Uncertainty-Aware Learning [59.19469551774703]
Drone-based vehicle detection aims at finding the vehicle locations and categories in an aerial image.
We construct a large-scale drone-based RGB-Infrared vehicle detection dataset, termed DroneVehicle.
Our DroneVehicle collects 28, 439 RGB-Infrared image pairs, covering urban roads, residential areas, parking lots, and other scenarios from day to night.
arXiv Detail & Related papers (2020-03-05T05:29:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.