BDC-Adapter: Brownian Distance Covariance for Better Vision-Language
Reasoning
- URL: http://arxiv.org/abs/2309.01256v1
- Date: Sun, 3 Sep 2023 19:45:02 GMT
- Title: BDC-Adapter: Brownian Distance Covariance for Better Vision-Language
Reasoning
- Authors: Yi Zhang, Ce Zhang, Zihan Liao, Yushun Tang, Zhihai He
- Abstract summary: We introduce Brownian Distance Covariance (BDC) to the field of vision-language reasoning.
BDC can model all possible relations, providing a robust metric for measuring feature dependence.
We present a novel method called BDC-Adapter, which integrates BDC prototype similarity reasoning and multi-modal reasoning network prediction.
- Score: 26.75156572762166
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and
ALIGN, have introduced a new paradigm for learning transferable visual
representations. Recently, there has been a surge of interest among researchers
in developing lightweight fine-tuning techniques to adapt these models to
downstream visual tasks. We recognize that current state-of-the-art fine-tuning
methods, such as Tip-Adapter, simply consider the covariance between the query
image feature and features of support few-shot training samples, which only
captures linear relations and potentially instigates a deceptive perception of
independence. To address this issue, in this work, we innovatively introduce
Brownian Distance Covariance (BDC) to the field of vision-language reasoning.
The BDC metric can model all possible relations, providing a robust metric for
measuring feature dependence. Based on this, we present a novel method called
BDC-Adapter, which integrates BDC prototype similarity reasoning and
multi-modal reasoning network prediction to perform classification tasks. Our
extensive experimental results show that the proposed BDC-Adapter can freely
handle non-linear relations and fully characterize independence, outperforming
the current state-of-the-art methods by large margins.
Related papers
- Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models [85.51753014478315]
We introduce AdaptPrune, a novel plug-and-play training-free pruning method.
It builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach.
Our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions.
arXiv Detail & Related papers (2025-03-11T03:58:17Z) - Enhancing Multimodal Entity Linking with Jaccard Distance-based Conditional Contrastive Learning and Contextual Visual Augmentation [37.22528391940295]
We propose JD-CCL (Jaccard Distance-based Contrastive Learning), a novel approach to enhance the ability to match multimodal entity linking models.
To address the limitations caused by the variations within the visual modality among mentions and entities, we introduce a novel method, CVaCPT (Con Visual-aid Controllable Patch Transform)
arXiv Detail & Related papers (2025-01-24T01:35:10Z) - Debias your Large Multi-Modal Model at Test-Time via Non-Contrastive Visual Attribute Steering [7.471995248769638]
We propose a training-free debiasing framework for large Multi-Modal Models (LMMs)
Our framework intervenes on the model's representations during text generation by constructing a steering vector that reduces reference on protected attributes.
Our experiments show that these interventions effectively reduce the propensity of LMMs to generate text related to protected attributes while maintaining sentiment and fluency.
arXiv Detail & Related papers (2024-11-15T20:06:09Z) - What Representational Similarity Measures Imply about Decodable Information [6.5879381737929945]
We show that some neural network similarity measures can be equivalently motivated from a decoding perspective.
Measures like CKA and CCA quantify the average alignment between optimal linear readouts across a distribution of decoding tasks.
Overall, our work demonstrates a tight link between the geometry of neural representations and the ability to linearly decode information.
arXiv Detail & Related papers (2024-11-12T21:37:10Z) - Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment [31.402736873762418]
Motivated by language model alignment methods, we propose textitCondition Contrastive Alignment (CCA) to facilitate guidance-free AR visual generation with high performance.
Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch fine-tuning.
This experimentally confirms the strong theoretical connection between language-targeted alignment and visual-targeted guidance methods.
arXiv Detail & Related papers (2024-10-12T03:31:25Z) - Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - Relational Concept Bottleneck Models [13.311396882130033]
Concept Bottleneck Models (CBMs) are not designed to solve problems.
R-CBMs are capable of both representing standard CBMs and relational GNNs.
In particular, we show that R-CBMs support the generation of concept-based explanations.
arXiv Detail & Related papers (2023-08-23T08:25:33Z) - Can Offline Reinforcement Learning Help Natural Language Understanding? [31.788133426611587]
We consider investigating the potential connection between offline reinforcement learning (RL) and language modeling (LM)
RL and LM are similar in predicting the next states based on the current and previous states, which rely on both local and long-range dependency across states.
Experimental results show that our RL pre-trained models can give close performance compared with the models using the LM training objective.
arXiv Detail & Related papers (2022-09-15T02:55:10Z) - Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z) - Minimizing subject-dependent calibration for BCI with Riemannian
transfer learning [0.8399688944263843]
We present a scheme to train a classifier on data recorded from different subjects, to reduce the calibration while preserving good performances.
To demonstrate the robustness of this approach, we conducted a meta-analysis on multiple datasets for three BCI paradigms.
arXiv Detail & Related papers (2021-11-23T18:37:58Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - Trusted Multi-View Classification [76.73585034192894]
We propose a novel multi-view classification method, termed trusted multi-view classification.
It provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level.
The proposed algorithm jointly utilizes multiple views to promote both classification reliability and robustness.
arXiv Detail & Related papers (2021-02-03T13:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.