Unsupervised Improvement of Audio-Text Cross-Modal Representations
- URL: http://arxiv.org/abs/2305.01864v3
- Date: Mon, 31 Jul 2023 18:28:36 GMT
- Title: Unsupervised Improvement of Audio-Text Cross-Modal Representations
- Authors: Zhepei Wang, Cem Subakan, Krishna Subramani, Junkai Wu, Tiago Tavares,
Fabio Ayres, Paris Smaragdis
- Abstract summary: We study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio.
We show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance.
- Score: 19.960695758478153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in using language models to obtain cross-modal audio-text
representations have overcome the limitations of conventional training
approaches that use predefined labels. This has allowed the community to make
progress in tasks like zero-shot classification, which would otherwise not be
possible. However, learning such representations requires a large amount of
human-annotated audio-text pairs. In this paper, we study unsupervised
approaches to improve the learning framework of such representations with
unpaired text and audio. We explore domain-unspecific and domain-specific
curation methods to create audio-text pairs that we use to further improve the
model. We also show that when domain-specific curation is used in conjunction
with a soft-labeled contrastive loss, we are able to obtain significant
improvement in terms of zero-shot classification performance on downstream
sound event classification or acoustic scene classification tasks.
Related papers
- Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Audio-visual Generalized Zero-shot Learning the Easy Way [20.60905505473906]
We introduce EZ-AVGZL, which aligns audio-visual embeddings with transformed text representations.
We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks.
arXiv Detail & Related papers (2024-07-18T01:57:16Z) - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Listenable Maps for Zero-Shot Audio Classifiers [12.446324804274628]
We introduce LMAC-Z (Listenable Maps for Audio) for the first time in the Zero-Shot context.
We show that our method produces meaningful explanations that correlate well with different text prompts.
arXiv Detail & Related papers (2024-05-27T19:25:42Z) - Enhance audio generation controllability through representation
similarity regularization [23.320569279485472]
We propose an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training.
Our proposed methods lead to improvements in objective metrics for both audio and music generation, as well as an enhancement in the human perception for audio generation.
arXiv Detail & Related papers (2023-09-15T21:32:20Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Leveraging Language Model Capabilities for Sound Event Detection [10.792576135806623]
We propose an end-to-end framework for understanding audio features while simultaneously generating sound event and their temporal location.
Specifically, we employ pretrained acoustic models to capture discriminative features across different categories and language models for autoregressive text generation.
arXiv Detail & Related papers (2023-08-22T15:59:06Z) - Cross-domain Voice Activity Detection with Self-Supervised
Representations [9.02236667251654]
Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal.
Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics.
We show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains.
arXiv Detail & Related papers (2022-09-22T14:53:44Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.