A Video is Worth 10,000 Words: Training and Benchmarking with Diverse
Captions for Better Long Video Retrieval
- URL: http://arxiv.org/abs/2312.00115v1
- Date: Thu, 30 Nov 2023 18:59:45 GMT
- Title: A Video is Worth 10,000 Words: Training and Benchmarking with Diverse
Captions for Better Long Video Retrieval
- Authors: Matthew Gwilliam and Michael Cogswell and Meng Ye and Karan Sikka and
Abhinav Shrivastava and Ajay Divakaran
- Abstract summary: Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime.
This neglects the richness and variety of possible valid descriptions of a video.
We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
- Score: 43.58794386905177
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing long video retrieval systems are trained and tested in the
paragraph-to-video retrieval regime, where every long video is described by a
single long paragraph. This neglects the richness and variety of possible valid
descriptions of a video, which could be described in moment-by-moment detail,
or in a single phrase summary, or anything in between. To provide a more
thorough evaluation of the capabilities of long video retrieval systems, we
propose a pipeline that leverages state-of-the-art large language models to
carefully generate a diverse set of synthetic captions for long videos. We
validate this pipeline's fidelity via rigorous human inspection. We then
benchmark a representative set of video language models on these synthetic
captions using a few long video datasets, showing that they struggle with the
transformed data, especially the shortest captions. We also propose a
lightweight fine-tuning method, where we use a contrastive loss to learn a
hierarchical embedding loss based on the differing levels of information among
the various captions. Our method improves performance both on the downstream
paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for
the various long video retrieval metrics we compute using our synthetic data
(+3.6% R@1 for short descriptions on ActivityNet). For data access and other
details, please refer to our project website at
https://mgwillia.github.io/10k-words.
Related papers
- Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval [56.05621657583251]
Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field.
We introduce RTime, a novel temporal-emphasized video-text retrieval dataset.
Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours.
arXiv Detail & Related papers (2024-12-26T11:32:00Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short
Video Search Scenarios [15.058793892803008]
Vision-Language Models pre-trained on large-scale image-text datasets have shown superior performance in downstream tasks such as image retrieval.
We establish the first large-scale cover-text benchmark for Chinese short video search scenarios.
UniCLIP has been deployed to Tencent's online video search systems with hundreds of millions of visits and achieved significant gains.
arXiv Detail & Related papers (2024-01-19T03:54:58Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - Text Synopsis Generation for Egocentric Videos [72.52130695707008]
We propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos.
Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database.
arXiv Detail & Related papers (2020-05-08T00:28:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.