Multi-Modality Driven LoRA for Adverse Condition Depth Estimation
- URL: http://arxiv.org/abs/2412.20162v1
- Date: Sat, 28 Dec 2024 14:23:58 GMT
- Title: Multi-Modality Driven LoRA for Adverse Condition Depth Estimation
- Authors: Guanglei Yang, Rui Tian, Yongqiang Zhang, Zhun Zhong, Yongqiang Li, Wangmeng Zuo,
- Abstract summary: We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.
It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)
It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
- Score: 61.525312117638116
- License:
- Abstract: The autonomous driving community is increasingly focused on addressing corner case problems, particularly those related to ensuring driving safety under adverse conditions (e.g., nighttime, fog, rain). To this end, the task of Adverse Condition Depth Estimation (ACDE) has gained significant attention. Previous approaches in ACDE have primarily relied on generative models, which necessitate additional target images to convert the sunny condition into adverse weather, or learnable parameters for feature augmentation to adapt domain gaps, resulting in increased model complexity and tuning efforts. Furthermore, unlike CLIP-based methods where textual and visual features have been pre-aligned, depth estimation models lack sufficient alignment between multimodal features, hindering coherent understanding under adverse conditions. To address these limitations, we propose Multi-Modality Driven LoRA (MMD-LoRA), which leverages low-rank adaptation matrices for efficient fine-tuning from source-domain to target-domain. It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning(VTCCL). During PDDA, the image encoder with MMD-LoRA generates target-domain visual representations, supervised by alignment loss that the source-target difference between language and image should be equal. Meanwhile, VTCCL bridges the gap between textual features from CLIP and visual features from diffusion model, pushing apart different weather representations (vision and text) and bringing together similar ones. Through extensive experiments, the proposed method achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets, underscoring robustness and efficiency in adapting to varied adverse environments.
Related papers
- Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack [51.16384207202798]
Vision-language pre-training models are vulnerable to multimodal adversarial examples (AEs)
Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process.
We propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity.
arXiv Detail & Related papers (2024-11-04T23:07:51Z) - Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - Boosting Visual Recognition in Real-world Degradations via Unsupervised Feature Enhancement Module with Deep Channel Prior [22.323789227447755]
Fog, low-light, and motion blur degrade image quality and pose threats to the safety of autonomous driving.
This work proposes a novel Deep Channel Prior (DCP) for degraded visual recognition.
Based on this, a novel plug-and-play Unsupervised Feature Enhancement Module (UFEM) is proposed to achieve unsupervised feature correction.
arXiv Detail & Related papers (2024-04-02T07:16:56Z) - Reinforcement Learning for SAR View Angle Inversion with Differentiable
SAR Renderer [7.112962861847319]
This study aims to reverse radar view angles in synthetic aperture radar (SAR) images given a target model.
An electromagnetic simulator named differentiable SAR render (DSR) is embedded to facilitate the interaction between the agent and the environment.
arXiv Detail & Related papers (2024-01-02T11:47:58Z) - DARTH: Holistic Test-time Adaptation for Multiple Object Tracking [87.72019733473562]
Multiple object tracking (MOT) is a fundamental component of perception systems for autonomous driving.
Despite the urge of safety in driving systems, no solution to the MOT adaptation problem to domain shift in test-time conditions has ever been proposed.
We introduce DARTH, a holistic test-time adaptation framework for MOT.
arXiv Detail & Related papers (2023-10-03T10:10:42Z) - Unsupervised Video Domain Adaptation for Action Recognition: A
Disentanglement Perspective [37.45565756522847]
We consider the generation of cross-domain videos from two sets of latent factors.
TranSVAE framework is then developed to model such generation.
Experiments on the UCF-HMDB, Jester, and Epic-Kitchens datasets verify the effectiveness and superiority of TranSVAE.
arXiv Detail & Related papers (2022-08-15T17:59:31Z) - Unsupervised Foggy Scene Understanding via Self Spatial-Temporal Label
Diffusion [51.11295961195151]
We exploit the characteristics of the foggy image sequence of driving scenes to densify the confident pseudo labels.
Based on the two discoveries of local spatial similarity and adjacent temporal correspondence of the sequential image data, we propose a novel Target-Domain driven pseudo label Diffusion scheme.
Our scheme helps the adaptive model achieve 51.92% and 53.84% mean intersection-over-union (mIoU) on two publicly available natural foggy datasets.
arXiv Detail & Related papers (2022-06-10T05:16:50Z) - Mitigating Modality Collapse in Multimodal VAEs via Impartial
Optimization [7.4262579052708535]
We argue that this effect is a consequence of conflicting gradients during multimodal VAE training.
We show how to detect the sub-graphs in the computational graphs where gradients conflict.
We empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.
arXiv Detail & Related papers (2022-06-09T13:29:25Z) - FCL-GAN: A Lightweight and Real-Time Baseline for Unsupervised Blind
Image Deblurring [72.43250555622254]
We propose a lightweight and real-time unsupervised BID baseline, termed Frequency-domain Contrastive Loss Constrained Lightweight CycleGAN.
FCL-GAN has attractive properties, i.e., no image domain limitation, no image resolution limitation, 25x lighter than SOTA, and 5x faster than SOTA.
Experiments on several image datasets demonstrate the effectiveness of FCL-GAN in terms of performance, model size and reference time.
arXiv Detail & Related papers (2022-04-16T15:08:03Z) - An Unsupervised Domain Adaptive Approach for Multimodal 2D Object
Detection in Adverse Weather Conditions [5.217255784808035]
We propose an unsupervised domain adaptation framework to bridge the domain gap between source and target domains.
We use a data augmentation scheme that simulates weather distortions to add domain confusion and prevent overfitting on the source data.
Experiments performed on the DENSE dataset show that our method can substantially alleviate the domain gap.
arXiv Detail & Related papers (2022-03-07T18:10:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.