Related papers: Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

URL: http://arxiv.org/abs/2308.11923v1
Date: Wed, 23 Aug 2023 05:13:25 GMT
Title: Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement
Authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino
Abstract summary: The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.
Score: 22.924746293106715
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. To evaluate the proposed methods, we built an AudioDiffCaps dataset consisting of pairs of similar but slightly different audio clips with human-annotated descriptions of their differences. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.

Related papers

LAVSS: Location-Guided Audio-Visual Spatial Audio Separation [52.44052357829296]
We propose a location-guided audio-visual spatial audio separator. The proposed LAVSS is inspired by the correlation between spatial audio and visual location. In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation.
arXiv Detail & Related papers (2023-10-31T13:30:24Z)
Audio Difference Learning for Audio Captioning [44.55621877667949]
This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. In the experiments using the Clotho and ESC50 datasets, the proposed method demonstrated an improvement in the SPIDEr score by 7% compared to conventional methods.
arXiv Detail & Related papers (2023-09-15T04:11:37Z)
Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting. The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model. Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z)
Towards Generating Diverse Audio Captions via Adversarial Training [33.76154801580643]
We propose a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-12-05T05:06:19Z)
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning. We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z)
AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
Caption Feature Space Regularization for Audio Captioning [24.40864471466915]
General audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio. We propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions.
arXiv Detail & Related papers (2022-04-18T17:07:31Z)
Using multiple reference audios and style embedding constraints for speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios. The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z)
Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings. To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings. Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.