Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
- URL: http://arxiv.org/abs/2210.16428v3
- Date: Mon, 29 May 2023 03:53:01 GMT
- Title: Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
- Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan
Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan
K{\i}l{\i}\c{c}, Wenwu Wang
- Abstract summary: How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
- Score: 54.4258176885084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio captioning aims to generate text descriptions of audio clips. In the
real world, many objects produce similar sounds. How to accurately recognize
ambiguous sounds is a major challenge for audio captioning. In this work,
inspired by inherent human multimodal perception, we propose visually-aware
audio captioning, which makes use of visual information to help the description
of ambiguous sounding objects. Specifically, we introduce an off-the-shelf
visual encoder to extract video features and incorporate the visual features
into an audio captioning system. Furthermore, to better exploit complementary
audio-visual contexts, we propose an audio-visual attention mechanism that
adaptively integrates audio and visual context and removes the redundant
information in the latent space. Experimental results on AudioCaps, the largest
audio captioning dataset, show that our proposed method achieves
state-of-the-art results on machine translation metrics.
Related papers
- Can Textual Semantics Mitigate Sounding Object Segmentation Preference? [10.368382203643739]
We argue that audio lacks robust semantics compared to vision, resulting in weak audio guidance over the visual space.
Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance.
Our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets.
arXiv Detail & Related papers (2024-07-15T17:45:20Z) - Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Speech inpainting: Context-based speech synthesis guided by video [29.233167442719676]
This paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment.
We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio.
We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.
arXiv Detail & Related papers (2023-06-01T09:40:47Z) - Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
Epic-Sounds is a large-scale dataset of audio annotations capturing temporal extents and class labels.
We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.
Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
arXiv Detail & Related papers (2023-02-01T18:19:37Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.