Related papers: UNIV: Unified Foundation Model for Infrared and Visible Modalities

UNIV: Unified Foundation Model for Infrared and Visible Modalities

URL: http://arxiv.org/abs/2509.15642v1
Date: Fri, 19 Sep 2025 06:07:53 GMT
Title: UNIV: Unified Foundation Model for Infrared and Visible Modalities
Authors: Fangyuan Mao, Shuo Wang, Jilin Mei, Chen Min, Shun Lu, Fuyang Liu, Yu Hu,
Abstract summary: We propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV)<n>PCCL is an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition.<n>Our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing.
Score: 12.0490466425884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The demand for joint RGB-visible and infrared perception is growing rapidly, particularly to achieve robust performance under diverse weather conditions. Although pre-trained models for RGB-visible and infrared data excel in their respective domains, they often underperform in multimodal scenarios, such as autonomous vehicles equipped with both sensors. To address this challenge, we propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV), featuring two key innovations. First, we introduce Patch-wise Cross-modality Contrastive Learning (PCCL), an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition, which enables effective cross-modal feature alignment while remaining compatible with any transformer-based architecture. Second, our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing - combining LoRA adapters (2% added parameters) with synchronous distillation to prevent catastrophic forgetting, thereby replicating the retina's photopic (cone-driven) and scotopic (rod-driven) functionality. To support cross-modal learning, we introduce the MVIP dataset, the most comprehensive visible-infrared benchmark to date. It contains 98,992 precisely aligned image pairs spanning diverse scenarios. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU in semantic segmentation and +0.7 mAP in object detection) while maintaining 99%+ of the baseline performance on visible RGB tasks. Our code is available at https://github.com/fangyuanmao/UNIV.

Related papers

MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration [63.87179575890912]
We propose Motion Integration DETR (MI-DETR), a bio-inspired dual-pathway detector for infrared small target detection.<n>First, a retina-inspired cellular automaton (RCA) converts raw frame sequences into a motion map defined on the same pixel grid as the appearance image.<n>Second, a Parvocellular-Magnocellular Interconnection (PMI) Block facilitates bidirectional feature interaction between the two pathways.
arXiv Detail & Related papers (2026-03-05T11:39:31Z)
IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection [92.56025546608699]
IrisNet is a novel meta-learned framework that adapts detection strategies to the input infrared image status.<n>Our approach establishes a dynamic mapping between infrared image features and entire decoder parameters.<n> Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet.
arXiv Detail & Related papers (2025-11-25T13:53:54Z)
MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery [1.005854289245731]
Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter.<n>We introduce Mamba-YOLO, a fusion module that balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms.<n>Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.
arXiv Detail & Related papers (2025-11-24T13:59:01Z)
CDUPatch: Color-Driven Universal Adversarial Patch Attack for Dual-Modal Visible-Infrared Detectors [6.8163437709379835]
Adversarial patches are widely used to evaluate the robustness of object detection systems in real-world scenarios.<n>We propose CDUPatch, a universal cross-modal patch attack against visible-infrared object detectors across scales, views, and scenarios.<n>By learning an optimal color distribution on the adversarial patch, we can manipulate its thermal response and generate an adversarial infrared texture.
arXiv Detail & Related papers (2025-04-15T05:46:00Z)
Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset [65.76480665062363]
Human Activity Recognition primarily relied on traditional RGB cameras to achieve high-performance activity recognition.<n>Challenges in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras.<n>In this work, we rethink human activity recognition by combining the RGB and event cameras.
arXiv Detail & Related papers (2025-04-08T09:14:24Z)
Multi-Domain Biometric Recognition using Body Embeddings [51.36007967653781]
We show that body embeddings perform better than face embeddings in medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains.<n>We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset.<n>We also show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art mAP scores.
arXiv Detail & Related papers (2025-03-13T22:38:18Z)
Bringing RGB and IR Together: Hierarchical Multi-Modal Enhancement for Robust Transmission Line Detection [67.02804741856512]
We propose a novel Hierarchical Multi-Modal Enhancement Network (HMMEN) that integrates RGB and IR data for robust and accurate TL detection.<n>Our method introduces two key components: (1) a Mutual Multi-Modal Enhanced Block (MMEB), which fuses and enhances hierarchical RGB and IR feature maps in a coarse-to-fine manner, and (2) a Feature Alignment Block (FAB) that corrects misalignments between decoder outputs and IR feature maps by leveraging deformable convolutions.
arXiv Detail & Related papers (2025-01-25T06:21:06Z)
MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection [12.462709547836289]
Using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive task such as object detection (OD) In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder. This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder, while countering the effects of modality imbalance.
arXiv Detail & Related papers (2024-04-29T16:42:58Z)
UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning [19.510261890672165]
We propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks.<n>Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (SFI) module, and a Supplementary Feature (SFI) module.<n> Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-04-26T12:21:57Z)
CycleTrans: Learning Neutral yet Discriminative Features for Visible-Infrared Person Re-Identification [79.84912525821255]
Visible-infrared person re-identification (VI-ReID) is a task of matching the same individuals across the visible and infrared modalities. Existing VI-ReID methods mainly focus on learning general features across modalities, often at the expense of feature discriminability. We present a novel cycle-construction-based network for neutral yet discriminative feature learning, termed CycleTrans.
arXiv Detail & Related papers (2022-08-21T08:41:40Z)
Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection [16.64781797503128]
RGB-thermal object detection (RGB-T SOD) aims to locate the common prominent objects of an aligned visible and thermal infrared image pair. In this paper, we propose a novel mirror complementary Transformer network (MCNet) for RGB-T SOD. Experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2022-07-07T20:26:09Z)
Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network. We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs. Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z)
Drone-based RGB-Infrared Cross-Modality Vehicle Detection via Uncertainty-Aware Learning [59.19469551774703]
Drone-based vehicle detection aims at finding the vehicle locations and categories in an aerial image. We construct a large-scale drone-based RGB-Infrared vehicle detection dataset, termed DroneVehicle. Our DroneVehicle collects 28, 439 RGB-Infrared image pairs, covering urban roads, residential areas, parking lots, and other scenarios from day to night.
arXiv Detail & Related papers (2020-03-05T05:29:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.