Related papers: Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

URL: http://arxiv.org/abs/2502.05139v1
Date: Fri, 07 Feb 2025 18:15:57 GMT
Title: Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Authors: Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, Wei-Ning Hsu,
Abstract summary: This paper addresses the need for automated systems capable of predicting audio aesthetics without human intervention.<n>We propose new annotation guidelines that decompose human listening perspectives into four distinct axes.<n>We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality.
Score: 46.7144966835279
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality. Our models are evaluated against human mean opinion scores (MOS) and existing methods, demonstrating comparable or superior performance. This research not only advances the field of audio aesthetics but also provides open-source models and datasets to facilitate future work and benchmarking. We release our code and pre-trained model at: https://github.com/facebookresearch/audiobox-aesthetics

Related papers

SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation [52.468945848774844]
This paper addresses the need for automated systems capable of evaluating audio separation without human intervention.<n>The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric.<n>SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation.
arXiv Detail & Related papers (2026-01-27T15:29:02Z)
QASTAnet: A DNN-based Quality Metric for Spatial Audio [0.0]
We propose QASTAnet (Quality Assessment for SpaTial Audio network), a new metric based on a deep neural network, specialized on spatial audio.<n>As training data is scarce, we aim for the model to be trainable with a small amount of data.<n>Results demonstrate that QASTAnet overcomes the limitations of the existing methods.
arXiv Detail & Related papers (2025-09-20T14:57:09Z)
AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z)
Evaluating Fake Music Detection Performance Under Audio Augmentations [0.0]
We construct a dataset consisting of both real and synthetic music generated using several systems.<n>We then apply a range of audio transformations and analyze how they affect classification accuracy.<n>We test the performance of a recent state-of-the-art musical deepfake detection model in the presence of audio augmentations.
arXiv Detail & Related papers (2025-07-07T16:15:02Z)
Evaluation of Deep Audio Representations for Hearables [1.5646349560044959]
This dataset includes 1,158 audio tracks, each 30 seconds long, created by spatially mixing proprietary monologues with high-quality recordings of everyday acoustic scenes. Our benchmark encompasses eight tasks that assess the general context, speech sources, and technical acoustic properties of the audio scenes. This superiority underscores the advantage of models trained on diverse audio collections, confirming their applicability to a wide array of auditory tasks, including encoding the environment properties necessary for hearable steering.
arXiv Detail & Related papers (2025-02-10T16:51:11Z)
Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation [8.170174172545831]
This paper addresses issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024. We present an evaluation protocol combining objective metric, namely Fr'echet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation.
arXiv Detail & Related papers (2024-10-23T06:35:41Z)
Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling? [40.3708221702947]
We aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data. Specifically, we conduct studies to explore the use of synthetic audio for audio recognition. We also investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling.
arXiv Detail & Related papers (2024-06-13T04:33:05Z)
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z)
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format. Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks. We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z)
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z)
Analysing the Impact of Audio Quality on the Use of Naturalistic Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills. Recent developments have enabled the use of more naturalistic training data for computational models. It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z)
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z)
Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data [9.072124914105325]
We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model.
arXiv Detail & Related papers (2020-05-29T01:30:14Z)
NAViDAd: A No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder [0.0]
We propose a No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd) The model is formed by a 2-layer framework that includes a deep autoencoder layer and a classification layer. The model performed well when tested against the UnB-AV and the LiveNetflix-II databases.
arXiv Detail & Related papers (2020-01-30T15:40:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.