HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models
- URL: http://arxiv.org/abs/2404.05083v1
- Date: Sun, 7 Apr 2024 21:46:47 GMT
- Title: HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models
- Authors: Yimu Wang, Shuai Yuan, Xiangru Jian, Wei Pang, Mushi Wang, Ning Yu,
- Abstract summary: We present a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features.
To bring richer information into video and text, we propose a hallucination-based augmentation method.
Benefiting from the enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of HaVTR over existing methods.
- Score: 11.883785732720094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While recent progress in video-text retrieval has been driven by the exploration of powerful model architectures and training strategies, the representation learning ability of video-text retrieval models is still limited due to low-quality and scarce training data annotations. To address this issue, we present a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features. Specifically, we first adopt a simple augmentation method, which generates self-similar data by randomly duplicating or dropping subwords and frames. In addition, inspired by the recent advancement in visual and language generative models, we propose a more powerful augmentation method through textual paraphrasing and video stylization using large language models (LLMs) and visual generative models (VGMs). Further, to bring richer information into video and text, we propose a hallucination-based augmentation method, where we use LLMs and VGMs to generate and add new relevant information to the original data. Benefiting from the enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of HaVTR over existing methods.
Related papers
- VideoRAG: Retrieval-Augmented Generation over Video Corpus [57.68536380621672]
VideoRAG is a novel framework that dynamically retrieves relevant videos based on their relevance with queries.
We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
arXiv Detail & Related papers (2025-01-10T11:17:15Z) - ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement [49.513401043490305]
This work explores the continual general pre-training of text-to-video models.
We break this task into two key aspects: increasing model capacity and improving semantic understanding.
For semantic understanding, we propose a method that leverages large language models as advanced text encoders.
arXiv Detail & Related papers (2024-12-25T18:58:07Z) - GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models [52.23899502520261]
We introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Leveraging Large Language Models for Node Generation in Few-Shot Learning on Text-Attributed Graphs [5.587264586806575]
We propose a plug-and-play approach to empower text-attributed graphs through node generation using Large Language Models (LLMs)
LLMs extract semantic information from labels and generate samples that belong to categories as exemplars.
We employ an edge predictor to capture structural information inherent in the raw dataset and integrate the newly generated samples into the original graph.
arXiv Detail & Related papers (2023-10-15T16:04:28Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.