The Influence of Audio on Video Memorability with an Audio Gestalt
Regulated Video Memorability System
- URL: http://arxiv.org/abs/2104.11568v1
- Date: Fri, 23 Apr 2021 12:53:33 GMT
- Title: The Influence of Audio on Video Memorability with an Audio Gestalt
Regulated Video Memorability System
- Authors: Lorin Sweeney, Graham Healy, Alan F. Smeaton
- Abstract summary: We find evidence to suggest that audio can facilitate overall video recognition memorability rich in high-level (gestalt) audio features.
We introduce a novel multimodal deep learning-based late-fusion system that uses audio gestalt to estimate the influence of a given video's audio on its overall short-term recognition memorability.
We benchmark our audio gestalt based system on the Memento10k short-term video memorability dataset, achieving top-2 state-of-the-art results.
- Score: 1.8506048493564673
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Memories are the tethering threads that tie us to the world, and memorability
is the measure of their tensile strength. The threads of memory are spun from
fibres of many modalities, obscuring the contribution of a single fibre to a
thread's overall tensile strength. Unfurling these fibres is the key to
understanding the nature of their interaction, and how we can ultimately create
more meaningful media content. In this paper, we examine the influence of audio
on video recognition memorability, finding evidence to suggest that it can
facilitate overall video recognition memorability rich in high-level (gestalt)
audio features. We introduce a novel multimodal deep learning-based late-fusion
system that uses audio gestalt to estimate the influence of a given video's
audio on its overall short-term recognition memorability, and selectively
leverages audio features to make a prediction accordingly. We benchmark our
audio gestalt based system on the Memento10k short-term video memorability
dataset, achieving top-2 state-of-the-art results.
Related papers
- ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z) - Seeing Voices: Generating A-Roll Video from Audio with Mirage [12.16029287095035]
Current approaches to video generation either ignore sound to focus on general-purpose but silent image sequence generation.<n>We introduce Mirage, an audio-to-video foundation model that excels at generating realistic, expressive output imagery from scratch given an audio input.
arXiv Detail & Related papers (2025-06-09T22:56:02Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Leveraging Audio Gestalt to Predict Media Memorability [1.8506048493564673]
Memorability determines what evanesces into emptiness, and what worms its way into the deepest furrows of our minds.
The Predicting Media Memorability task in MediaEval 2020 aims to address the question of media memorability by setting the task of automatically predicting video memorability.
Our approach is a multimodal deep learning-based late fusion that combines visual, semantic, and auditory features.
arXiv Detail & Related papers (2020-12-31T14:50:42Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.