Related papers: A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

URL: http://arxiv.org/abs/2208.02080v1
Date: Wed, 3 Aug 2022 14:05:20 GMT
Title: A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval
Authors: Alex Falcon and Giuseppe Serra and Oswald Lanz
Abstract summary: Text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples. We propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples.
Score: 16.548016892117083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples. We experiment our solution on a large scale public dataset, EPIC-Kitchens-100, and achieve considerable improvements over a baseline method, improved state-of-the-art performance, while at the same time performing multiple ablation studies. We release code and pretrained models on Github at https://github.com/aranciokov/FSMMDA_VideoRetrieval.

Related papers

Video-T1: Test-Time Scaling for Video Generation [19.089876374170167]
Researchers in Large Language Models (LLMs) have expanded the scaling to test-time. We aim to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. Experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos.
arXiv Detail & Related papers (2025-03-24T17:59:04Z)
VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion [9.465414294387507]
We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior to video diffusion models. We introduce a large-scale video referring matting dataset with 10,000 videos.
arXiv Detail & Related papers (2025-03-11T06:12:35Z)
Video Decomposition Prior: A Methodology to Decompose Videos into Layers [74.36790196133505]
This paper introduces a novel video decomposition prior VDP' framework which derives inspiration from professional video editing practices. VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels. We address tasks such as video object segmentation, dehazing, and relighting.
arXiv Detail & Related papers (2024-12-06T10:35:45Z)
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation [98.92677830223786]
This work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. We propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Our proposed method achieves performance comparable to or even superior to baselines trained with many more samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z)
HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models [11.883785732720094]
We present a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features. To bring richer information into video and text, we propose a hallucination-based augmentation method. Benefiting from the enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of HaVTR over existing methods.
arXiv Detail & Related papers (2024-04-07T21:46:47Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos. To improve generalization, we show that one model can be trained with multiple text styles. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z)
Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval [55.088635195893325]
We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ) HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos. Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
arXiv Detail & Related papers (2022-02-07T18:04:10Z)
Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency. Our method is only trained on a pair of original and processed videos directly instead of a large dataset. We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z)
TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining. We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time. Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z)
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features. We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks. Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.