In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval
- URL: http://arxiv.org/abs/2309.08928v1
- Date: Sat, 16 Sep 2023 08:48:21 GMT
- Title: In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval
- Authors: Nina Shvetsova, Anna Kukleva, Bernt Schiele, Hilde Kuehne
- Abstract summary: We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
- Score: 72.98185525653504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale noisy web image-text datasets have been proven to be efficient
for learning robust vision-language models. However, when transferring them to
the task of video retrieval, models still need to be fine-tuned on hand-curated
paired text-video data to adapt to the diverse styles of video descriptions. To
address this problem without the need for hand-annotated pairs, we propose a
new setting, text-video retrieval with uncurated & unpaired data, that during
training utilizes only text queries together with uncurated web videos without
any paired text-video data. To this end, we propose an approach, In-Style, that
learns the style of the text queries and transfers it to uncurated web videos.
Moreover, to improve generalization, we show that one model can be trained with
multiple text styles. To this end, we introduce a multi-style contrastive
training procedure that improves the generalizability over several datasets
simultaneously. We evaluate our model on retrieval performance over multiple
datasets to demonstrate the advantages of our style transfer framework on the
new task of uncurated & unpaired text-video retrieval and improve
state-of-the-art performance on zero-shot text-video retrieval.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - Text-Adaptive Multiple Visual Prototype Matching for Video-Text
Retrieval [125.55386778388818]
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web.
We propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video.
Our method outperforms state-of-the-art methods on four public video retrieval datasets.
arXiv Detail & Related papers (2022-09-27T11:13:48Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z) - Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations.
We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text.
We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.