MAD: A Scalable Dataset for Language Grounding in Videos from Movie
Audio Descriptions
- URL: http://arxiv.org/abs/2112.00431v1
- Date: Wed, 1 Dec 2021 11:47:09 GMT
- Title: MAD: A Scalable Dataset for Language Grounding in Videos from Movie
Audio Descriptions
- Authors: Mattia Soldan, Alejandro Pardo, Juan Le\'on Alc\'azar, Fabian Caba
Heilbron, Chen Zhao, Silvio Giancola, Bernard Ghanem
- Abstract summary: We present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations.
MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets.
- Score: 109.84031235538002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent and increasing interest in video-language research has driven the
development of large-scale datasets that enable data-intensive machine learning
techniques. In comparison, limited effort has been made at assessing the
fitness of these datasets for the video-language grounding task. Recent works
have begun to discover significant limitations in these datasets, suggesting
that state-of-the-art techniques commonly overfit to hidden dataset biases. In
this work, we present MAD (Movie Audio Descriptions), a novel benchmark that
departs from the paradigm of augmenting existing video datasets with text
annotations and focuses on crawling and aligning available audio descriptions
of mainstream movies. MAD contains over 384,000 natural language sentences
grounded in over 1,200 hours of video and exhibits a significant reduction in
the currently diagnosed biases for video-language grounding datasets. MAD's
collection strategy enables a novel and more challenging version of
video-language grounding, where short temporal moments (typically seconds long)
must be accurately grounded in diverse long-form videos that can last up to
three hours.
Related papers
- The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [0.0]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [58.08209212057164]
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges.
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - VidLA: Video-Language Alignment at Scale [48.665918882615195]
We propose VidLA, an approach for video-language alignment at scale.
Our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks.
arXiv Detail & Related papers (2024-03-21T22:36:24Z) - Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding [108.79026216923984]
Video grounding aims to localize a-temporal section in a video corresponding to an input text query.
This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task.
arXiv Detail & Related papers (2023-12-31T13:53:37Z) - FSVVD: A Dataset of Full Scene Volumetric Video [2.9151420469958533]
In this paper, we focus on the current most widely used data format, point cloud, and for the first time release a full-scene volumetric video dataset.
Comprehensive dataset description and analysis are conducted, with potential usage of this dataset.
arXiv Detail & Related papers (2023-03-07T02:31:08Z) - Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.