MAD: A Scalable Dataset for Language Grounding in Videos from Movie
Audio Descriptions
- URL: http://arxiv.org/abs/2112.00431v1
- Date: Wed, 1 Dec 2021 11:47:09 GMT
- Title: MAD: A Scalable Dataset for Language Grounding in Videos from Movie
Audio Descriptions
- Authors: Mattia Soldan, Alejandro Pardo, Juan Le\'on Alc\'azar, Fabian Caba
Heilbron, Chen Zhao, Silvio Giancola, Bernard Ghanem
- Abstract summary: We present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations.
MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets.
- Score: 109.84031235538002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent and increasing interest in video-language research has driven the
development of large-scale datasets that enable data-intensive machine learning
techniques. In comparison, limited effort has been made at assessing the
fitness of these datasets for the video-language grounding task. Recent works
have begun to discover significant limitations in these datasets, suggesting
that state-of-the-art techniques commonly overfit to hidden dataset biases. In
this work, we present MAD (Movie Audio Descriptions), a novel benchmark that
departs from the paradigm of augmenting existing video datasets with text
annotations and focuses on crawling and aligning available audio descriptions
of mainstream movies. MAD contains over 384,000 natural language sentences
grounded in over 1,200 hours of video and exhibits a significant reduction in
the currently diagnosed biases for video-language grounding datasets. MAD's
collection strategy enables a novel and more challenging version of
video-language grounding, where short temporal moments (typically seconds long)
must be accurately grounded in diverse long-form videos that can last up to
three hours.
Related papers
- Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [53.235170710385006]
We introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner.
We sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge.
In experiments, Grounded-VideoLLM excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA.
arXiv Detail & Related papers (2024-10-04T10:04:37Z) - ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models [53.9661582975843]
Video Temporal Grounding aims to ground specific segments within an untrimmed video corresponding to a given natural language query.
Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive and prone to human biases.
We present ChatVTG, a novel approach that utilizes Video Dialogue Large Language Models (LLMs) for zero-shot video temporal grounding.
arXiv Detail & Related papers (2024-10-01T08:27:56Z) - SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses [58.488812405557]
Video grounding aims to localize specific natural language queries in an untrimmed video.
We present a large-scale video grounding dataset named SynopGround.
We introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG)
arXiv Detail & Related papers (2024-08-03T05:35:13Z) - Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding [19.544839928488972]
We construct a large-scale multilingual video story dataset named Multilingual Synopses of Movie Narratives (M-SYMON)
M-SYMON contains 13,166 movie summary videos from 7 languages, as well as manual annotation of fine-grained video-text correspondences for 101.5 hours of video.
Training on the human annotated data from SyMoN outperforms the SOTA methods by 15.7 and 16.2 percentage points on Clip Accuracy and Sentence IoU scores, respectively.
arXiv Detail & Related papers (2024-06-18T22:44:50Z) - VidLA: Video-Language Alignment at Scale [48.665918882615195]
We propose VidLA, an approach for video-language alignment at scale.
Our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks.
arXiv Detail & Related papers (2024-03-21T22:36:24Z) - FSVVD: A Dataset of Full Scene Volumetric Video [2.9151420469958533]
In this paper, we focus on the current most widely used data format, point cloud, and for the first time release a full-scene volumetric video dataset.
Comprehensive dataset description and analysis are conducted, with potential usage of this dataset.
arXiv Detail & Related papers (2023-03-07T02:31:08Z) - Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.