p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
- URL: http://arxiv.org/abs/2312.10613v1
- Date: Sun, 17 Dec 2023 05:30:35 GMT
- Title: p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
- Authors: Haoyuan Wu, Xinyun Zhang, Peng Xu, Peiyu Liao, Xufeng Yao, Bei Yu
- Abstract summary: Vision-Language models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks.
PETL has garnered attention as a viable alternative to full fine-tuning.
We propose a new adapter architecture, $p$-adapter, which employs $p$-Laplacian message passing in Graph Neural Networks (GNNs)
- Score: 10.713680139939354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language models (VLMs) pre-trained on large corpora have demonstrated
notable success across a range of downstream tasks. In light of the rapidly
increasing size of pre-trained VLMs, parameter-efficient transfer learning
(PETL) has garnered attention as a viable alternative to full fine-tuning. One
such approach is the adapter, which introduces a few trainable parameters into
the pre-trained models while preserving the original parameters during
adaptation. In this paper, we present a novel modeling framework that recasts
adapter tuning after attention as a graph message passing process on attention
graphs, where the projected query and value features and attention matrix
constitute the node features and the graph adjacency matrix, respectively.
Within this framework, tuning adapters in VLMs necessitates handling
heterophilic graphs, owing to the disparity between the projected query and
value space. To address this challenge, we propose a new adapter architecture,
$p$-adapter, which employs $p$-Laplacian message passing in Graph Neural
Networks (GNNs). Specifically, the attention weights are re-normalized based on
the features, and the features are then aggregated using the calibrated
attention matrix, enabling the dynamic exploitation of information with varying
frequencies in the heterophilic attention graphs. We conduct extensive
experiments on different pre-trained VLMs and multi-modal tasks, including
visual question answering, visual entailment, and image captioning. The
experimental results validate our method's significant superiority over other
PETL methods.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter [19.557300178619382]
We propose a novel Heterogeneous Graph Adapter to achieve tuning VLMs for the downstream tasks.
We employ a specific Heterogeneous Graph Neural Network to excavate multi-modality structure knowledge for the downstream tasks.
Experimental results on 11 benchmark datasets demonstrate the effectiveness and benefits of the proposed HeGraphAdapter.
arXiv Detail & Related papers (2024-10-10T12:20:58Z) - A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior.
Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks.
GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation.
Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z) - G-Adapter: Towards Structure-Aware Parameter-Efficient Transfer Learning
for Graph Transformer Networks [0.7118812771905295]
We show that it is sub-optimal to directly transfer existing PEFTs to graph-based tasks due to the issue of feature distribution shift.
We propose a novel structure-aware PEFT approach, named G-Adapter, to guide the updating process.
Extensive experiments demonstrate that G-Adapter obtains the state-of-the-art performance compared to the counterparts on nine graph benchmark datasets.
arXiv Detail & Related papers (2023-05-17T16:10:36Z) - Multimodal Graph Transformer for Multimodal Question Answering [9.292566397511763]
We propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities.
We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information.
We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
arXiv Detail & Related papers (2023-04-30T21:22:35Z) - SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained
Models [9.017387427570538]
Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs.
Due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required.
We present a new approach called SVL-Adapter that combines the complementary strengths of both vision-language pretraining and self-supervised representation learning.
arXiv Detail & Related papers (2022-10-07T19:35:08Z) - Towards a Unified View on Visual Parameter-Efficient Transfer Learning [96.99924127527002]
We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off.
An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
arXiv Detail & Related papers (2022-10-03T09:54:39Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - Pre-Trained Models for Heterogeneous Information Networks [57.78194356302626]
We propose a self-supervised pre-training and fine-tuning framework, PF-HIN, to capture the features of a heterogeneous information network.
PF-HIN consistently and significantly outperforms state-of-the-art alternatives on each of these tasks, on four datasets.
arXiv Detail & Related papers (2020-07-07T03:36:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.