RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models
- URL: http://arxiv.org/abs/2512.06811v1
- Date: Sun, 07 Dec 2025 12:04:46 GMT
- Title: RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models
- Authors: Xiang Lin, Weixin Li, Shu Guo, Lihong Wang, Di Huang,
- Abstract summary: We introduce a novel Reconstruction-based Multimodal Adapter (RMAdapter)<n> RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameter-efficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent space features back into the original feature space.<n>By computing reconstruction loss locally at each layer and sharing projection modules, the overall computational overhead is kept minimal.
- Score: 36.97549106050972
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained Vision-Language Models (VLMs), \textit{e.g.} CLIP, have become essential tools in multimodal transfer learning. However, fine-tuning VLMs in few-shot scenarios poses significant challenges in balancing task-specific adaptation and generalization in the obtained model. Meanwhile, current researches have predominantly focused on prompt-based adaptation methods, leaving adapter-based approaches underexplored and revealing notable performance gaps. To address these challenges, we introduce a novel Reconstruction-based Multimodal Adapter (RMAdapter), which leverages a dual-branch architecture. Unlike conventional single-branch adapters, RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameter-efficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent space features back into the original feature space. This design facilitates a dynamic balance between general and task-specific knowledge. Importantly, although RMAdapter introduces an additional reconstruction branch, it is carefully optimized to remain lightweight. By computing reconstruction loss locally at each layer and sharing projection modules, the overall computational overhead is kept minimal. A consistency constraint is also incorporated to better regulate the trade-off between discriminability and generalization. We comprehensively evaluate the effectiveness of RMAdapter on three representative tasks: generalization to new categories, generalization to new target datasets, and domain generalization. Without relying on data augmentation or duplicate prompt designs, our RMAdapter consistently outperforms state-of-the-art approaches across all evaluation metrics.
Related papers
- Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model [2.2099003320482393]
Attn-Adapter is a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism.<n>Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features.<n>Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.
arXiv Detail & Related papers (2025-09-04T05:42:02Z) - GENRE-CMR: Generalizable Deep Learning for Diverse Multi-Domain Cardiac MRI Reconstruction [0.8749675983608171]
We propose GENRE-CMR, a generative adversarial network (GAN)-based architecture to enhance reconstruction fidelity and generalization.<n>Experiments confirm that GENRE-CMR surpasses state-of-the-art methods on training and unseen data, achieving 0.9552 SSIM and 38.90 dB PSNR on unseen distributions.<n>Our framework presents a unified and robust solution for high-quality CMR reconstruction, paving the way for clinically adaptable deployment across heterogeneous acquisition protocols.
arXiv Detail & Related papers (2025-08-28T09:43:59Z) - Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts [72.22148263683037]
We study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures.<n>First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature.<n>Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks.
arXiv Detail & Related papers (2025-07-09T03:25:45Z) - Towards Generalizable Trajectory Prediction Using Dual-Level Representation Learning And Adaptive Prompting [107.4034346788744]
Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions.<n>We propose Perceiver with Register queries (PerReg+), a novel trajectory prediction framework that introduces: (1) Dual-Level Representation Learning via Self-Distillation (SD) and Masked Reconstruction (MR), capturing global context and fine-grained details; (2) Enhanced Multimodality using register-based queries and pretraining, eliminating the need for clustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning, freezing the main architecture and optimizing a small number of prompts for efficient adaptation.
arXiv Detail & Related papers (2025-01-08T20:11:09Z) - Modality-Collaborative Transformer with Hybrid Feature Reconstruction
for Robust Emotion Recognition [35.15390769958969]
We propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR)
MCT-HFR consists of a novel attention-based encoder which concurrently extracts and dynamically balances the intra- and inter-modality relations.
During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise complete and incomplete representations.
arXiv Detail & Related papers (2023-12-26T01:59:23Z) - Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration [58.11518043688793]
MPerceiver is a novel approach to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration.
MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across most tasks.
arXiv Detail & Related papers (2023-12-05T17:47:11Z) - Vision Transformer Adapters for Generalizable Multitask Learning [61.79647180647685]
We introduce the first multitasking vision transformer adapters that learn generalizable task affinities.
Our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner.
In contrast to concurrent methods, we do not require retraining or fine-tuning whenever a new task or domain is added.
arXiv Detail & Related papers (2023-08-23T18:40:48Z) - Generalized Few-Shot Continual Learning with Contrastive Mixture of
Adapters [59.82088750033897]
We set up a Generalized FSCL (GFSCL) protocol involving both class- and domain-incremental situations.
We find that common continual learning methods have poor generalization ability on unseen domains.
In this way, we propose a rehearsal-free framework based on Vision Transformer (ViT) named Contrastive Mixture of Adapters (CMoA)
arXiv Detail & Related papers (2023-02-12T15:18:14Z) - An Optimization-Based Meta-Learning Model for MRI Reconstruction with
Diverse Dataset [4.9259403018534496]
We develop a generalizable MRI reconstruction model in the meta-learning framework.
The proposed network learns regularization function in a learner adaptional model.
We test the result of quick training on the unseen tasks after meta-training and in the saving half of the time.
arXiv Detail & Related papers (2021-10-02T03:21:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.