Related papers: Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning

Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning

URL: http://arxiv.org/abs/2510.10022v1
Date: Sat, 11 Oct 2025 04:58:21 GMT
Title: Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning
Authors: Junan Chen, Trung Thanh Nguyen, Takahiro Komamizu, Ichiro Ide,
Abstract summary: We propose a lightweight visual adapter module designed to enhance Multimodal Large Language Models (MLLMs)<n>Q-Adapter introduces learnable query tokens and a gating layer into Vision, enabling effective extraction of sparse, caption-relevant features without relying on external supervision.<n>We evaluate Q-Adapter on two well-known video captioning datasets, MSR-VTT and MSVD, where it achieves state-of-the-art performance.
Score: 5.762008844570409
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in video captioning are driven by large-scale pretrained models, which follow the standard "pre-training followed by fine-tuning" paradigm, where the full model is fine-tuned for downstream tasks. Although effective, this approach becomes computationally prohibitive as the model size increases. The Parameter-Efficient Fine-Tuning (PEFT) approach offers a promising alternative, but primarily focuses on the language components of Multimodal Large Language Models (MLLMs). Despite recent progress, PEFT remains underexplored in multimodal tasks and lacks sufficient understanding of visual information during fine-tuning the model. To bridge this gap, we propose Query-Adapter (Q-Adapter), a lightweight visual adapter module designed to enhance MLLMs by enabling efficient fine-tuning for the video captioning task. Q-Adapter introduces learnable query tokens and a gating layer into Vision Encoder, enabling effective extraction of sparse, caption-relevant features without relying on external textual supervision. We evaluate Q-Adapter on two well-known video captioning datasets, MSR-VTT and MSVD, where it achieves state-of-the-art performance among the methods that take the PEFT approach across BLEU@4, METEOR, ROUGE-L, and CIDEr metrics. Q-Adapter also achieves competitive performance compared to methods that take the full fine-tuning approach while requiring only 1.4% of the parameters. We further analyze the impact of key hyperparameters and design choices on fine-tuning effectiveness, providing insights into optimization strategies for adapter-based learning. These results highlight the strong potential of Q-Adapter in balancing caption quality and parameter efficiency, demonstrating its scalability for video-language modeling.

Related papers

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models. Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.<n>EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
Lightweight Modular Parameter-Efficient Tuning for Open-Vocabulary Object Detection [2.1155908599769764]
We propose UniProj-Det, a lightweight modular framework for parameter-efficient open-vocabulary object detection.<n>UniProj-Det freezes pretrained backbones and introduces a Universal Projection module with a learnable modality token, enabling unified vision--language adaptation at minimal cost.
arXiv Detail & Related papers (2024-08-20T12:27:53Z)
CROME: Cross-Modal Adapters for Efficient Multimodal LLM [28.337072921099494]
Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities. Existing approaches often necessitate expensive language model retraining and limited adaptability. We propose CROME, an efficient vision-language instruction tuning framework.
arXiv Detail & Related papers (2024-08-13T03:45:11Z)
CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA) The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z)
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z)
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models [10.713680139939354]
Vision-Language models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks. PETL has garnered attention as a viable alternative to full fine-tuning. We propose a new adapter architecture, $p$-adapter, which employs $p$-Laplacian message passing in Graph Neural Networks (GNNs)
arXiv Detail & Related papers (2023-12-17T05:30:35Z)
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models [35.5601603013045]
We propose SmartTrim, an adaptive acceleration framework for Vision-Language Models (VLMs) We integrate lightweight modules into the original backbone to identify and prune redundant token representations and attention heads within each layer. We devise a self-distillation strategy to enhance the consistency between the predictions of the pruned model and its fully-capacity counterpart.
arXiv Detail & Related papers (2023-05-24T11:18:00Z)
Towards a Unified View on Visual Parameter-Efficient Transfer Learning [96.99924127527002]
We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off. An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
arXiv Detail & Related papers (2022-10-03T09:54:39Z)
Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding [37.199310579532884]
This paper explores the task of Temporal Video Grounding (TVG) TVG is where, given an untrimmed video and a natural language sentence query, the goal is to recognize and determine temporal boundaries of action instances in the video. Recent works tackled this task by improving query inputs with large pre-trained language models (PLM) at the cost of more expensive training.
arXiv Detail & Related papers (2022-09-26T08:11:19Z)
Parameter-Efficient Image-to-Video Transfer Learning [66.82811235484607]
Large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes costly in terms of model training and storage. We propose a new Spatio-Adapter for parameter-efficient fine-tuning per video task.
arXiv Detail & Related papers (2022-06-27T18:02:29Z)
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks [71.40656211497162]
Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks. We introduce adapter-based parameter-efficient transfer learning techniques to V&L models such as VL-BART and VL-T5. Our results demonstrate that training the adapter with the weight-sharing technique can match the performance of fine-tuning the entire model.
arXiv Detail & Related papers (2021-12-13T17:35:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.