QuerYD: A video dataset with high-quality text and audio narrations
- URL: http://arxiv.org/abs/2011.11071v2
- Date: Wed, 17 Feb 2021 13:38:19 GMT
- Title: QuerYD: A video dataset with high-quality text and audio narrations
- Authors: Andreea-Maria Oncescu, Jo\~ao F. Henriques, Yang Liu, Andrew
Zisserman, Samuel Albanie
- Abstract summary: We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
- Score: 85.6468286746623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce QuerYD, a new large-scale dataset for retrieval and event
localisation in video. A unique feature of our dataset is the availability of
two audio tracks for each video: the original audio, and a high-quality spoken
description of the visual content. The dataset is based on YouDescribe, a
volunteer project that assists visually-impaired people by attaching voiced
narrations to existing YouTube videos. This ever-growing collection of videos
contains highly detailed, temporally aligned audio and text annotations. The
content descriptions are more relevant than dialogue, and more detailed than
previous description attempts, which can be observed to contain many
superficial or uninformative descriptions. To demonstrate the utility of the
QuerYD dataset, we show that it can be used to train and benchmark strong
models for retrieval and event localisation. Data, code and models are made
publicly available, and we hope that QuerYD inspires further research on video
understanding with written and spoken natural language.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Fine-grained Audible Video Description [61.81122862375985]
We construct the first fine-grained audible video description benchmark (FAVDBench)
For each video clip, we first provide a one-sentence summary of the video, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end.
We demonstrate that employing fine-grained video descriptions can create more intricate videos than using captions.
arXiv Detail & Related papers (2023-03-27T22:03:48Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.