A Feature-space Multimodal Data Augmentation Technique for Text-video
Retrieval
- URL: http://arxiv.org/abs/2208.02080v1
- Date: Wed, 3 Aug 2022 14:05:20 GMT
- Title: A Feature-space Multimodal Data Augmentation Technique for Text-video
Retrieval
- Authors: Alex Falcon and Giuseppe Serra and Oswald Lanz
- Abstract summary: Text-video retrieval methods have received increased attention over the past few years.
Data augmentation techniques were introduced to increase the performance on unseen test examples.
We propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples.
- Score: 16.548016892117083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Every hour, huge amounts of visual contents are posted on social media and
user-generated content platforms. To find relevant videos by means of a natural
language query, text-video retrieval methods have received increased attention
over the past few years. Data augmentation techniques were introduced to
increase the performance on unseen test examples by creating new training
samples with the application of semantics-preserving techniques, such as color
space or geometric transformations on images. Yet, these techniques are usually
applied on raw data, leading to more resource-demanding solutions and also
requiring the shareability of the raw data, which may not always be true, e.g.
copyright issues with clips from movies or TV series. To address this
shortcoming, we propose a multimodal data augmentation technique which works in
the feature space and creates new videos and captions by mixing semantically
similar samples. We experiment our solution on a large scale public dataset,
EPIC-Kitchens-100, and achieve considerable improvements over a baseline
method, improved state-of-the-art performance, while at the same time
performing multiple ablation studies. We release code and pretrained models on
Github at https://github.com/aranciokov/FSMMDA_VideoRetrieval.
Related papers
- HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models [11.883785732720094]
We present a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features.
To bring richer information into video and text, we propose a hallucination-based augmentation method.
Benefiting from the enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of HaVTR over existing methods.
arXiv Detail & Related papers (2024-04-07T21:46:47Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval [55.088635195893325]
We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ)
HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos.
Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
arXiv Detail & Related papers (2022-02-07T18:04:10Z) - Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z) - TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.