Audiovisual Moments in Time: A Large-Scale Annotated Dataset of
Audiovisual Actions
- URL: http://arxiv.org/abs/2308.09685v1
- Date: Fri, 18 Aug 2023 17:13:45 GMT
- Title: Audiovisual Moments in Time: A Large-Scale Annotated Dataset of
Audiovisual Actions
- Authors: Michael Joannou, Pia Rotshtein, Uta Noppeney
- Abstract summary: We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events.
The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants.
- Score: 1.1510009152620668
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of
audiovisual action events. In an extensive annotation task 11 participants
labelled a subset of 3-second audiovisual videos from the Moments in Time
dataset (MIT). For each trial, participants assessed whether the labelled
audiovisual action event was present and whether it was the most prominent
feature of the video. The dataset includes the annotation of 57,177 audiovisual
videos, each independently evaluated by 3 of 11 trained participants. From this
initial collection, we created a curated test set of 16 distinct action
classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed
audiovisual feature embeddings, using VGGish/YamNet for audio data and
VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for
audiovisual DNN research. We explored the advantages of AVMIT annotations and
feature embeddings to improve performance on audiovisual event recognition. A
series of 6 Recurrent Neural Networks (RNNs) were trained on either
AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and
then tested on our audiovisual test set. In all RNNs, top 1 accuracy was
increased by 2.71-5.94\% by training exclusively on audiovisual events, even
outweighing a three-fold increase in training data. We anticipate that the
newly annotated AVMIT dataset will serve as a valuable resource for research
and comparative experiments involving computational models and human
participants, specifically when addressing research questions where audiovisual
correspondence is of critical importance.
Related papers
- Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.
This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale
Benchmark and Baseline [53.07236039168652]
We focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video.
We introduce the first Untrimmed Audio-Visual dataset, which contains 10K untrimmed videos with over 30K audio-visual events.
Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass.
arXiv Detail & Related papers (2023-03-22T22:00:17Z) - OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset [14.619865864254924]
Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset is the largest among publicly available audio-visual speech datasets.
The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations.
arXiv Detail & Related papers (2023-01-16T11:40:50Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - Audiovisual transfer learning for audio tagging and sound event
detection [21.574781022415372]
We study the merit of transfer learning for two sound recognition problems, i.e., audio tagging and sound event detection.
We adapt a baseline system utilizing only spectral acoustic inputs to make use of pretrained auditory and visual features.
We perform experiments with these modified models on an audiovisual multi-label data set.
arXiv Detail & Related papers (2021-06-09T21:55:05Z) - APES: Audiovisual Person Search in Untrimmed Video [87.4124877066541]
We present the Audiovisual Person Search dataset (APES)
APES contains over 1.9K identities labeled along 36 hours of video.
A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity.
arXiv Detail & Related papers (2021-06-03T08:16:42Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.