CoLA: Conditional Dropout and Language-driven Robust Dual-modal Salient Object Detection
- URL: http://arxiv.org/abs/2407.06780v1
- Date: Tue, 9 Jul 2024 11:49:24 GMT
- Title: CoLA: Conditional Dropout and Language-driven Robust Dual-modal Salient Object Detection
- Authors: Shuang Hao, Chunlin Zhong, He Tang,
- Abstract summary: We introduce textbfConditional Dropout and textbfLAnguage-driven(textbfCoLA) framework.
The framework mitigates the impact of noisy inputs, while preserving its performance under complete modalities.
It outperforms state-of-the-art dual-modal SOD models, under both modality-complete and modality-missing conditions.
- Score: 12.780661306169474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The depth/thermal information is beneficial for detecting salient object with conventional RGB images. However, in dual-modal salient object detection (SOD) model, the robustness against noisy inputs and modality missing is crucial but rarely studied. To tackle this problem, we introduce \textbf{Co}nditional Dropout and \textbf{LA}nguage-driven(\textbf{CoLA}) framework comprising two core components. 1) Language-driven Quality Assessment (LQA): Leveraging a pretrained vision-language model with a prompt learner, the LQA recalibrates image contributions without requiring additional quality annotations. This approach effectively mitigates the impact of noisy inputs. 2) Conditional Dropout (CD): A learning method to strengthen the model's adaptability in scenarios with missing modalities, while preserving its performance under complete modalities. The CD serves as a plug-in training scheme that treats modality-missing as conditions, strengthening the overall robustness of various dual-modal SOD models. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art dual-modal SOD models, under both modality-complete and modality-missing conditions. We will release source code upon acceptance.
Related papers
- Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration [40.720288165545476]
We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features.<n>Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features to steer the generation of semantically consistent features; (II) Cross-Modal Mutual Learning mechanism, which bridges the semantic spaces of dual encoders to achieve bidirectional alignment.
arXiv Detail & Related papers (2026-02-03T06:06:35Z) - DIS2: Disentanglement Meets Distillation with Classwise Attention for Robust Remote Sensing Segmentation under Missing Modalities [28.992992584085787]
DIS2 is a new paradigm shifting from modality-shared feature dependence to active, guided missing features compensation.<n> Compensatory features are explicitly captured which, when fused with the features of the available modality, approximate the ideal fused representation of the full-modality case.<n>Our proposed approach significantly outperforms state-of-the-art methods across benchmarks.
arXiv Detail & Related papers (2026-01-20T01:33:54Z) - Dual-level Modality Debiasing Learning for Unsupervised Visible-Infrared Person Re-Identification [59.59359638389348]
We propose a Dual-level Modality Debiasing Learning framework that implements debiasing at both the model and optimization levels.<n>Experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.
arXiv Detail & Related papers (2025-12-03T12:43:16Z) - Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Dynamic Modality Scheduling for Multimodal Large Models via Confidence, Uncertainty, and Semantic Consistency [0.0]
We propose Dynamic Modality Scheduling (DMS), a novel framework that adaptively adjusts the contribution of each modality at a per-sample level.<n> Experimental results on VQA, image-text retrieval, and captioning tasks show that DMS significantly improves both clean and robust performance.
arXiv Detail & Related papers (2025-06-15T05:15:52Z) - TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models [6.8298782282181865]
We introduce $textitTemporally Adaptive Interpolated Distillation (TAID)$, a novel knowledge distillation approach.
We show TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios.
These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.
arXiv Detail & Related papers (2025-01-28T13:31:18Z) - Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities [8.517830626176641]
Any2Seg is a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions.
Experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting.
arXiv Detail & Related papers (2024-07-16T03:34:38Z) - Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation [33.140210057065644]
This paper introduces a novel approach named Stealing Stable Diffusion (SSD) prior for robust monocular depth estimation.
The approach addresses this limitation by utilizing stable diffusion to generate synthetic images that mimic challenging conditions.
The effectiveness of the approach is evaluated on nuScenes and Oxford RobotCar, two challenging public datasets.
arXiv Detail & Related papers (2024-03-08T05:06:31Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Dynamic Weighted Combiner for Mixed-Modal Image Retrieval [8.683144453481328]
Mixed-Modal Image Retrieval (MMIR) as a flexible search paradigm has attracted wide attention.
Previous approaches always achieve limited performance, due to two critical factors.
We propose a Dynamic Weighted Combiner (DWC) to tackle the above challenges.
arXiv Detail & Related papers (2023-12-11T07:36:45Z) - What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities? [35.19295402483624]
We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
arXiv Detail & Related papers (2023-10-10T07:47:57Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - Towards Improving Faithfulness in Abstractive Summarization [37.19777407790153]
We propose a Faithfulness Enhanced Summarization model (FES) to improve fidelity in abstractive summarization.
Our model outperforms strong baselines in experiments on CNN/DM and XSum.
arXiv Detail & Related papers (2022-10-04T19:52:09Z) - Self-attention fusion for audiovisual emotion recognition with
incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition.
We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.