READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for
Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling
- URL: http://arxiv.org/abs/2312.06950v1
- Date: Tue, 12 Dec 2023 03:09:30 GMT
- Title: READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for
Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling
- Authors: Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Khoi Le, Zhiyuan Hu, Cong-Duy
Nguyen, See-Kiong Ng, Luu Anh Tuan
- Abstract summary: We introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time.
Existing adapters fail to capture intrinsic temporal relations among video frames or textual words.
We propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability.
- Score: 33.11253005768816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fully fine-tuning pretrained large-scale transformer models has become a
popular paradigm for video-language modeling tasks, such as temporal language
grounding and video-language summarization. With a growing number of tasks and
limited training data, such full fine-tuning approach leads to costly model
storage and unstable training. To overcome these shortcomings, we introduce
lightweight adapters to the pre-trained model and only update them at
fine-tuning time. However, existing adapters fail to capture intrinsic temporal
relations among video frames or textual words. Moreover, they neglect the
preservation of critical task-related information that flows from the raw
video-language input into the adapter's low-dimensional space. To address these
issues, we first propose a novel REcurrent ADapter (READ) that employs
recurrent computation to enable temporal modeling capability. Second, we
propose Partial Video-Language Alignment (PVLA) objective via the use of
partial optimal transport to maintain task-related information flowing into our
READ modules. We validate our READ-PVLA framework through extensive experiments
where READ-PVLA significantly outperforms all existing fine-tuning strategies
on multiple low-resource temporal language grounding and video-language
summarization benchmarks.
Related papers
- RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter [77.0205013713008]
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries.
To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained vision models.
We propose a sparse-andcorrelated AdaPter (RAP) to fine-tune the pre-trained model with a few parameterized layers.
arXiv Detail & Related papers (2024-05-29T19:23:53Z) - LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form
Video-Text Understanding [48.83009641950664]
We introduce a novel approach called Language-guided Spatial-Temporal Prompt Learning (LSTP)
This approach features two key components: a Temporal Prompt Sampler (TPS) with optical flow prior that leverages temporal information to efficiently extract relevant video content, and a Spatial Prompt solver (SPS) that adeptly captures the intricate spatial relationships between visual and textual elements.
By harmonizing TPS and SPS with a cohesive training strategy, our framework significantly enhances computational efficiency, temporal understanding, and spatial-temporal alignment.
arXiv Detail & Related papers (2024-02-25T10:27:46Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
Modeling [48.283659682112926]
We propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks.
We also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text.
arXiv Detail & Related papers (2022-10-21T13:03:49Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis.
We characterize the limitations and potential of current video-language benchmarks.
We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.