Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions
- URL: http://arxiv.org/abs/2105.04489v1
- Date: Mon, 10 May 2021 16:30:46 GMT
- Title: Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions
- Authors: Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio
Feris, James Glass, Aude Oliva
- Abstract summary: We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
- Score: 75.77044856100349
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When people observe events, they are able to abstract key information and
build concise summaries of what is happening. These summaries include
contextual and semantic information describing the important high-level details
(what, where, who and how) of the observed event and exclude background
information that is deemed unimportant to the observer. With this in mind, the
descriptions people generate for videos of different dynamic events can greatly
improve our understanding of the key information of interest in each video.
These descriptions can be captured in captions that provide expanded attributes
for video labeling (e.g. actions/objects/scenes/sentiment/etc.) while allowing
us to gain new insight into what people find important or necessary to
summarize specific events. Existing caption datasets for video understanding
are either small in scale or restricted to a specific domain. To address this,
we present the Spoken Moments (S-MiT) dataset of 500k spoken captions each
attributed to a unique short video depicting a broad range of different events.
We collect our descriptions using audio recordings to ensure that they remain
as natural and concise as possible while allowing us to scale the size of a
large classification dataset. In order to utilize our proposed dataset, we
present a novel Adaptive Mean Margin (AMM) approach to contrastive learning and
evaluate our models on video/caption retrieval on multiple datasets. We show
that our AMM approach consistently improves our results and that models trained
on our Spoken Moments dataset generalize better than those trained on other
video-caption datasets.
Related papers
- Enhancing Long Video Understanding via Hierarchical Event-Based Memory [9.800516656566774]
We propose a Hierarchical Event-based Memory-enhanced LLM (HEM-LLM) for better understanding of long videos.
Firstly, we design a novel adaptive sequence segmentation scheme to divide multiple events within long videos.
Secondly, while modeling current event, we compress and inject the information of the previous event to enhance the long-term inter-event dependencies in videos.
arXiv Detail & Related papers (2024-09-10T07:53:10Z) - SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies.
We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events.
Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z) - OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail
Enhancement [44.228748086927375]
We introduce the video-based object-oriented video captioning network (OVC)-Net via temporal graph and detail enhancement.
To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods.
arXiv Detail & Related papers (2020-03-08T04:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.