Furnishing Sound Event Detection with Language Model Abilities
- URL: http://arxiv.org/abs/2308.11530v1
- Date: Tue, 22 Aug 2023 15:59:06 GMT
- Title: Furnishing Sound Event Detection with Language Model Abilities
- Authors: Hualei Wang, Jianguo Mao, Zhifang Guo, Jiarui Wan, Hong Liu, Xiangdong
Wang
- Abstract summary: We propose an elegant method that aligns audio features and text features to accomplish sound event classification and temporal location.
The framework consists of an acoustic encoder, a contrastive module that align the corresponding representations of the text and audio, and a decoupled language decoder.
- Score: 11.435984426303419
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the ability of language models (LMs) has attracted increasing
attention in visual cross-modality. In this paper, we further explore the
generation capacity of LMs for sound event detection (SED), beyond the visual
domain. Specifically, we propose an elegant method that aligns audio features
and text features to accomplish sound event classification and temporal
location. The framework consists of an acoustic encoder, a contrastive module
that align the corresponding representations of the text and audio, and a
decoupled language decoder that generates temporal and event sequences from the
audio characteristic. Compared with conventional works that require complicated
processing and barely utilize limited audio features, our model is more concise
and comprehensive since language model directly leverage its semantic
capabilities to generate the sequences. We investigate different decoupling
modules to demonstrate the effectiveness for timestamps capture and event
classification. Evaluation results show that the proposed method achieves
accurate sequences of sound event detection.
Related papers
- Classification of Spontaneous and Scripted Speech for Multilingual Audio [9.925703861731506]
Distinguishing scripted from spontaneous speech is an essential tool for better understanding how speech styles influence speech processing research.
This paper addresses the challenge of building a classifier that generalises well across different formats and languages.
We systematically evaluate models ranging from traditional, handcrafted acoustic and prosodic features to advanced audio transformers.
arXiv Detail & Related papers (2024-12-16T15:45:10Z) - Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs [3.8300818830608345]
Multi-modal contrastive learning strategies for audio and text have rapidly gained interest.
The ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research.
We propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.
arXiv Detail & Related papers (2024-08-17T18:53:17Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Teach me with a Whisper: Enhancing Large Language Models for Analyzing
Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data.
This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time.
In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z) - Enhance audio generation controllability through representation
similarity regularization [23.320569279485472]
We propose an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training.
Our proposed methods lead to improvements in objective metrics for both audio and music generation, as well as an enhancement in the human perception for audio generation.
arXiv Detail & Related papers (2023-09-15T21:32:20Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Unsupervised Improvement of Audio-Text Cross-Modal Representations [19.960695758478153]
We study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio.
We show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance.
arXiv Detail & Related papers (2023-05-03T02:30:46Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - CTAL: Pre-training Cross-modal Transformer for Audio-and-Language
Representations [20.239063010740853]
We present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language.
We observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification.
arXiv Detail & Related papers (2021-09-01T04:18:19Z) - Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification [81.80311855996584]
We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model.
We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
arXiv Detail & Related papers (2021-02-15T07:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.