DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
- URL: http://arxiv.org/abs/2401.10588v1
- Date: Fri, 19 Jan 2024 09:58:06 GMT
- Title: DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
- Authors: Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang
- Abstract summary: Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query.
We propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention.
In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts.
- Score: 73.82017200889906
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text-video retrieval is a critical multi-modal task to find the most relevant
video for a text query. Although pretrained models like CLIP have demonstrated
impressive potential in this area, the rising cost of fully finetuning these
models due to increasing model size continues to pose a problem. To address
this challenge, prompt tuning has emerged as an alternative. However, existing
works still face two problems when adapting pretrained image-text models to
downstream video-text tasks: (1) The visual encoder could only encode
frame-level features and failed to extract global-level general video
information. (2) Equipping the visual and text encoder with separated prompts
failed to mitigate the visual-text modality gap. To this end, we propose DGL, a
cross-modal Dynamic prompt tuning method with Global-Local video attention. In
contrast to previous prompt tuning methods, we employ the shared latent space
to generate local-level text and frame prompts that encourage inter-modal
interaction. Furthermore, we propose modeling video in a global-local attention
mechanism to capture global video information from the perspective of prompt
tuning. Extensive experiments reveal that when only 0.67% parameters are tuned,
our cross-modal prompt tuning strategy DGL outperforms or is comparable to
fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets.
Code will be available at https://github.com/knightyxp/DGL
Related papers
- MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.
First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.
Second, we present MotionAura, a text-to-video generation framework.
Third, we propose a spectral transformer-based denoising network.
Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z) - VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization [115.64739269488965]
VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
arXiv Detail & Related papers (2024-04-30T15:49:03Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - GL-RG: Global-Local Representation Granularity for Video Captioning [52.56883051799501]
We propose a GL-RG framework for video captioning, namely a textbfGlobal-textbfLocal textbfRepresentation textbfGranularity.
Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; and 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning
arXiv Detail & Related papers (2022-05-22T02:00:09Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.