AudioBERT: Audio Knowledge Augmented Language Model
- URL: http://arxiv.org/abs/2409.08199v2
- Date: Thu, 16 Jan 2025 12:17:18 GMT
- Title: AudioBERT: Audio Knowledge Augmented Language Model
- Authors: Hyunjong Ok, Suho Yoo, Jaeho Lee,
- Abstract summary: Recent studies have identified that language models, pretrained on text-only datasets, often lack elementary visual knowledge.<n>We construct a new dataset called AuditoryBench, which consists of two novel tasks for evaluating auditory knowledge.<n>Based on our analysis using the benchmark, we find that language models also suffer from a severe lack of auditory knowledge.<n>We propose AudioBERT, a novel method to augment the auditory knowledge of BERT through a retrieval-based approach.
- Score: 11.136112399898481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have identified that language models, pretrained on text-only datasets, often lack elementary visual knowledge, \textit{e.g.,} colors of everyday objects. Motivated by this observation, we ask whether a similar shortcoming exists in terms of the \textit{auditory} knowledge. To answer this question, we construct a new dataset called AuditoryBench, which consists of two novel tasks for evaluating auditory knowledge. Based on our analysis using the benchmark, we find that language models also suffer from a severe lack of auditory knowledge. To address this limitation, we propose AudioBERT, a novel method to augment the auditory knowledge of BERT through a retrieval-based approach. First, we detect auditory knowledge spans in prompts to query our retrieval model efficiently. Then, we inject audio knowledge into BERT and switch on low-rank adaptation for effective adaptation when audio knowledge is required. Our experiments demonstrate that AudioBERT is quite effective, achieving superior performance on the AuditoryBench. The dataset and code are available at \bulurl{https://github.com/HJ-Ok/AudioBERT}.
Related papers
- Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models [11.136112399898481]
We propose Imagine to Hear, a novel approach that dynamically generates auditory knowledge using generative models.
Our framework detects multiple audio-related textual spans from the given prompt and generates corresponding auditory knowledge.
Our experiments show that our method achieves state-of-the-art performance on AuditoryBench without relying on external databases.
arXiv Detail & Related papers (2025-03-21T04:56:22Z) - Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.
This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations [1.2101820447447276]
Multi-modal learning in the audio-language domain has seen significant advancements in recent years.
However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks.
Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations.
This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models.
arXiv Detail & Related papers (2024-05-17T21:08:58Z) - Learning Audio Concepts from Counterfactual Natural Language [34.118579918018725]
This study introduces causal reasoning and counterfactual analysis in the audio domain.
Our model considers acoustic characteristics and sound source information from human-annotated reference texts.
Specifically, the top-1 accuracy in open-ended language-based audio retrieval task increased by more than 43%.
arXiv Detail & Related papers (2024-01-10T05:15:09Z) - Teach me with a Whisper: Enhancing Large Language Models for Analyzing
Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data.
This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time.
In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
Compressing Audio Knowledge of a Pretrained Model [53.492751392755636]
We propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality.
We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.
arXiv Detail & Related papers (2023-08-15T06:38:38Z) - Improving Natural-Language-based Audio Retrieval with Transfer Learning
and Audio & Text Augmentations [7.817685358710508]
We propose a system to project recordings and textual descriptions into a shared audio-caption space.
Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance.
We further show that pre-training the system on the AudioCaps dataset leads to additional improvements.
arXiv Detail & Related papers (2022-08-24T11:54:42Z) - Leveraging Pre-trained BERT for Audio Captioning [45.16535378268039]
BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks.
We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model.
Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.
arXiv Detail & Related papers (2022-03-06T00:05:58Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.