A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
- URL: http://arxiv.org/abs/2312.00115v2
- Date: Mon, 09 Dec 2024 23:46:06 GMT
- Title: A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
- Authors: Matthew Gwilliam, Michael Cogswell, Meng Ye, Karan Sikka, Abhinav Shrivastava, Ajay Divakaran,
- Abstract summary: Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime.
This neglects the richness and variety of possible valid descriptions of a video, which could range anywhere from moment-by-moment detail to a single phrase summary.
We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
- Score: 41.13561065438316
- License:
- Abstract: Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could range anywhere from moment-by-moment detail to a single phrase summary. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We use synthetic captions from this pipeline to perform a benchmark of a representative set of video language models using long video datasets, and show that the models struggle on shorter captions. We show that finetuning on this data can both mitigate these issues (+2.8% R@1 over SOTA on ActivityNet with diverse captions), and even improve performance on standard paragraph-to-video retrieval (+1.0% R@1 on ActivityNet). We also use synthetic data from our pipeline as query expansion in the zero-shot setting (+3.4% R@1 on ActivityNet). We derive insights by analyzing failure cases for retrieval with short captions. For data access and other details, please refer to our project website at https://mgwillia.github.io/10k-words.
Related papers
- Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval [56.05621657583251]
Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field.
We introduce RTime, a novel temporal-emphasized video-text retrieval dataset.
Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours.
arXiv Detail & Related papers (2024-12-26T11:32:00Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short
Video Search Scenarios [15.058793892803008]
Vision-Language Models pre-trained on large-scale image-text datasets have shown superior performance in downstream tasks such as image retrieval.
We establish the first large-scale cover-text benchmark for Chinese short video search scenarios.
UniCLIP has been deployed to Tencent's online video search systems with hundreds of millions of visits and achieved significant gains.
arXiv Detail & Related papers (2024-01-19T03:54:58Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - Text Synopsis Generation for Egocentric Videos [72.52130695707008]
We propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos.
Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database.
arXiv Detail & Related papers (2020-05-08T00:28:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.