IVGF: The Fusion-Guided Infrared and Visible General Framework
- URL: http://arxiv.org/abs/2409.00973v2
- Date: Sat, 14 Sep 2024 15:05:59 GMT
- Title: IVGF: The Fusion-Guided Infrared and Visible General Framework
- Authors: Fangcen Liu, Chenqiang Gao, Fang Chen, Pengcheng Li, Junjie Guo, Deyu Meng,
- Abstract summary: Infrared and visible dual-modality tasks can achieve robust performance even in extreme scenes by fusing complementary information.
We propose a fusion-guided infrared and visible general framework, IVGF, which can be easily extended to many high-level vision tasks.
- Score: 41.07925395888705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Infrared and visible dual-modality tasks such as semantic segmentation and object detection can achieve robust performance even in extreme scenes by fusing complementary information. Most current methods design task-specific frameworks, which are limited in generalization across multiple tasks. In this paper, we propose a fusion-guided infrared and visible general framework, IVGF, which can be easily extended to many high-level vision tasks. Firstly, we adopt the SOTA infrared and visible foundation models to extract the general representations. Then, to enrich the semantics information of these general representations for high-level vision tasks, we design the feature enhancement module and token enhancement module for feature maps and tokens, respectively. Besides, the attention-guided fusion module is proposed for effectively fusing by exploring the complementary information of two modalities. Moreover, we also adopt the cutout&mix augmentation strategy to conduct the data augmentation, which further improves the ability of the model to mine the regional complementary between the two modalities. Extensive experiments show that the IVGF outperforms state-of-the-art dual-modality methods in the semantic segmentation and object detection tasks. The detailed ablation studies demonstrate the effectiveness of each module, and another experiment explores the anti-missing modality ability of the proposed method in the dual-modality semantic segmentation task.
Related papers
- A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion [41.34335755315773]
Multi-modality image fusion aims at fusing specific-modality and shared-modality information from two source images.
We propose a three-branch encoder-decoder architecture along with corresponding fusion layers as the fusion strategy.
Our method has obtained competitive results compared with state-of-the-art methods in visible/infrared image fusion and medical image fusion tasks.
arXiv Detail & Related papers (2024-06-11T09:32:40Z) - Dual-modal Prior Semantic Guided Infrared and Visible Image Fusion for Intelligent Transportation System [22.331591533400402]
Infrared and visible image fusion (IVF) plays an important role in intelligent transportation system (ITS)
We propose a novel prior semantic guided image fusion method based on the dual-modality strategy.
arXiv Detail & Related papers (2024-03-24T16:41:50Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues.
We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects.
Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z) - Bi-level Dynamic Learning for Jointly Multi-modality Image Fusion and
Beyond [50.556961575275345]
We build an image fusion module to fuse complementary characteristics and cascade dual task-related modules.
We develop an efficient first-order approximation to compute corresponding gradients and present dynamic weighted aggregation to balance the gradients for fusion learning.
arXiv Detail & Related papers (2023-05-11T10:55:34Z) - Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of
Semantics and Depth [83.94528876742096]
We tackle the MTL problem of two dense tasks, ie, semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module (CCAM)
In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called AffineMix, and a simple depth augmentation using predicted semantics called ColorAug.
Finally, we validate the performance gain of the proposed method on the Cityscapes dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic
arXiv Detail & Related papers (2022-06-21T17:40:55Z) - Target-aware Dual Adversarial Learning and a Multi-scenario
Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection [65.30079184700755]
This study addresses the issue of fusing infrared and visible images that appear differently for object detection.
Previous approaches discover commons underlying the two modalities and fuse upon the common space either by iterative optimization or deep networks.
This paper proposes a bilevel optimization formulation for the joint problem of fusion and detection, and then unrolls to a target-aware Dual Adversarial Learning (TarDAL) network for fusion and a commonly used detection network.
arXiv Detail & Related papers (2022-03-30T11:44:56Z) - One for All: An End-to-End Compact Solution for Hand Gesture Recognition [8.321276216978637]
This paper proposes an end-to-end compact CNN framework: fine grained feature attentive network for hand gesture recognition (Fit-Hand)
The pipeline of the proposed architecture consists of two main units: FineFeat module and dilated convolutional (Conv) layer.
The effectiveness of Fit-Hand is evaluated by using subject dependent (SD) and subject independent (SI) validation setup over seven benchmark datasets.
arXiv Detail & Related papers (2021-05-15T05:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.