A Video is Worth 10,000 Words: Training and Benchmarking with Diverse
Captions for Better Long Video Retrieval
- URL: http://arxiv.org/abs/2312.00115v1
- Date: Thu, 30 Nov 2023 18:59:45 GMT
- Title: A Video is Worth 10,000 Words: Training and Benchmarking with Diverse
Captions for Better Long Video Retrieval
- Authors: Matthew Gwilliam and Michael Cogswell and Meng Ye and Karan Sikka and
Abhinav Shrivastava and Ajay Divakaran
- Abstract summary: Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime.
This neglects the richness and variety of possible valid descriptions of a video.
We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
- Score: 43.58794386905177
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing long video retrieval systems are trained and tested in the
paragraph-to-video retrieval regime, where every long video is described by a
single long paragraph. This neglects the richness and variety of possible valid
descriptions of a video, which could be described in moment-by-moment detail,
or in a single phrase summary, or anything in between. To provide a more
thorough evaluation of the capabilities of long video retrieval systems, we
propose a pipeline that leverages state-of-the-art large language models to
carefully generate a diverse set of synthetic captions for long videos. We
validate this pipeline's fidelity via rigorous human inspection. We then
benchmark a representative set of video language models on these synthetic
captions using a few long video datasets, showing that they struggle with the
transformed data, especially the shortest captions. We also propose a
lightweight fine-tuning method, where we use a contrastive loss to learn a
hierarchical embedding loss based on the differing levels of information among
the various captions. Our method improves performance both on the downstream
paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for
the various long video retrieval metrics we compute using our synthetic data
(+3.6% R@1 for short descriptions on ActivityNet). For data access and other
details, please refer to our project website at
https://mgwillia.github.io/10k-words.
Related papers
- CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval [24.203328970223527]
We present CaReBench, a testing benchmark for fine-grained video captioning and retrieval.
Uniquely, it provides manually separated spatial annotations and temporal annotations for each video.
Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks.
arXiv Detail & Related papers (2024-12-31T15:53:50Z) - Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval [56.05621657583251]
Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field.
We introduce RTime, a novel temporal-emphasized video-text retrieval dataset.
Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours.
arXiv Detail & Related papers (2024-12-26T11:32:00Z) - Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos.
Our model uses a novel autoregressive factorized decoding architecture.
Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z) - Personalized Video Summarization by Multimodal Video Understanding [2.1372652192505703]
We present a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization.
VSL is based on pre-trained visual language models (VLMs) to avoid the need to train a video summarization system on a large training dataset.
We show that our method is more adaptable across different datasets compared to supervised query-based video summarization models.
arXiv Detail & Related papers (2024-11-05T22:14:35Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short
Video Search Scenarios [15.058793892803008]
Vision-Language Models pre-trained on large-scale image-text datasets have shown superior performance in downstream tasks such as image retrieval.
We establish the first large-scale cover-text benchmark for Chinese short video search scenarios.
UniCLIP has been deployed to Tencent's online video search systems with hundreds of millions of visits and achieved significant gains.
arXiv Detail & Related papers (2024-01-19T03:54:58Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - Text Synopsis Generation for Egocentric Videos [72.52130695707008]
We propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos.
Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database.
arXiv Detail & Related papers (2020-05-08T00:28:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.