MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection
- URL: http://arxiv.org/abs/2404.18849v2
- Date: Fri, 2 Aug 2024 16:13:40 GMT
- Title: MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection
- Authors: Heitor R. Medeiros, David Latortue, Eric Granger, Marco Pedersoli,
- Abstract summary: Using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive task such as object detection (OD)
In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder.
This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder, while countering the effects of modality imbalance.
- Score: 12.462709547836289
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In real-world scenarios, using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive task such as object detection (OD). Multimodal learning is a common way to leverage these modalities, where multiple modality-specific encoders and a fusion module are used to improve performance. In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder. This realistic setting requires a lower memory footprint and is more suitable for applications such as autonomous driving and surveillance, which commonly rely on RGB and IR data. However, when learning a single encoder on multiple modalities, one modality can dominate the other, producing uneven recognition results. This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder, while countering the effects of modality imbalance. For this, we introduce a novel training technique to Mix Patches (MiPa) from the two modalities, in conjunction with a patch-wise modality agnostic module, for learning a common representation of both modalities. Our experiments show that MiPa can learn a representation to reach competitive results on traditional RGB/IR benchmarks while only requiring a single modality during inference. Our code is available at: https://github.com/heitorrapela/MiPa.
Related papers
- UniRGB-IR: A Unified Framework for RGB-Infrared Semantic Tasks via Adapter Tuning [17.36726475620881]
We propose a general and efficient framework called UniRGB-IR to unify RGB-IR semantic tasks.
A novel adapter is developed to efficiently introduce richer RGB-IR features into the pre-trained foundation model.
Experimental results on various RGB-IR downstream tasks demonstrate that our method can achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-04-26T12:21:57Z) - Modality Translation for Object Detection Adaptation Without Forgetting Prior Knowledge [11.905387325966311]
This paper focuses on adapting a large object detection model trained on RGB images to new data extracted from IR images.
We propose Modality Translator (ModTr) as an alternative to the common approach of fine-tuning a large model to the new modality.
arXiv Detail & Related papers (2024-04-01T21:28:50Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE.
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z) - Students taught by multimodal teachers are superior action recognizers [41.821485757189656]
The focal point of egocentric video understanding is modelling hand-object interactions.
Standard models -- CNNs, Vision Transformers, etc. -- which receive RGB frames as input perform well, however, their performance improves further by employing additional modalities such as object detections, optical flow, audio, etc.
The goal of this work is to retain the performance of such multimodal approaches, while using only the RGB images as input at inference time.
arXiv Detail & Related papers (2022-10-09T19:37:17Z) - A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers [0.0]
We propose a recipe for transferring pretrained ViTs in RGB-D domains for single-view 3D object recognition.
We show that our adapted ViTs score up to 95.1% top-1 accuracy in Washington, achieving new state-of-the-art results in this benchmark.
arXiv Detail & Related papers (2022-10-03T12:08:09Z) - Unified Object Detector for Different Modalities based on Vision
Transformers [1.14219428942199]
We develop a unified detector that achieves superior performance across diverse modalities.
Our research envisions an application scenario for robotics, where the unified system seamlessly switches between RGB cameras and depth sensors.
We evaluate our unified model on the SUN RGB-D dataset, and demonstrate that it achieves similar or better performance in terms of mAP50.
arXiv Detail & Related papers (2022-07-03T16:01:04Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Flexible-Modal Face Anti-Spoofing: A Benchmark [66.18359076810549]
Face anti-spoofing (FAS) plays a vital role in securing face recognition systems from presentation attacks.
We establish the first flexible-modal FAS benchmark with the principle train one for all'
We also investigate prevalent deep models and feature fusion strategies for flexible-modal FAS.
arXiv Detail & Related papers (2022-02-16T16:55:39Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.