Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio
and Tags
- URL: http://arxiv.org/abs/2010.14171v1
- Date: Tue, 27 Oct 2020 10:13:17 GMT
- Title: Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio
and Tags
- Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra
- Abstract summary: We propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention mechanism.
Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.
- Score: 32.456824945999465
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised audio representation learning offers an attractive
alternative for obtaining generic audio embeddings, capable to be employed into
various downstream tasks. Published approaches that consider both audio and
words/tags associated with audio do not employ text processing models that are
capable to generalize to tags unknown during training. In this work we propose
a method for learning audio representations using an audio autoencoder (AAE), a
general word embeddings model (WEM), and a multi-head self-attention (MHA)
mechanism. MHA attends on the output of the WEM, providing a contextualized
representation of the tags associated with the audio, and we align the output
of MHA with the output of the encoder of AAE using a contrastive loss. We
jointly optimize AAE and MHA and we evaluate the audio representations (i.e.
the output of the encoder of AAE) by utilizing them in three different
downstream tasks, namely sound, music genre, and music instrument
classification. Our results show that employing multi-head self-attention with
multiple heads in the tag-based network can induce better learned audio
representations.
Related papers
- EnCodecMAE: Leveraging neural codecs for universal audio representation learning [16.590638305972632]
We propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments.
We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds.
arXiv Detail & Related papers (2023-09-14T02:21:53Z) - Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Automatic Audio Captioning using Attention weighted Event based
Embeddings [25.258177951665594]
We propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC.
Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature.
arXiv Detail & Related papers (2022-01-28T05:54:19Z) - Audio Captioning with Composition of Acoustic and Semantic Information [1.90365714903665]
We present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings.
To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings.
Our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics.
arXiv Detail & Related papers (2021-05-13T15:30:14Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.