SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation
- URL: http://arxiv.org/abs/2601.19702v1
- Date: Tue, 27 Jan 2026 15:29:02 GMT
- Title: SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation
- Authors: Helin Wang, Bowen Shi, Andros Tjandra, John Hoffman, Yi-Chiao Wu, Apoorv Vyas, Najim Dehak, Ann Lee, Wei-Ning Hsu,
- Abstract summary: This paper addresses the need for automated systems capable of evaluating audio separation without human intervention.<n>The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric.<n>SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation.
- Score: 52.468945848774844
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The performance evaluation remains a complex challenge in audio separation, and existing evaluation metrics are often misaligned with human perception, course-grained, relying on ground truth signals. On the other hand, subjective listening tests remain the gold standard for real-world evaluation, but they are expensive, time-consuming, and difficult to scale. This paper addresses the growing need for automated systems capable of evaluating audio separation without human intervention. The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric, which shows highly alignment with human perceptions. SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation (recall, percision, faithfulness, and overall). SAM Audio Judge also shows potential applications in data filtering, pseudo-labeling large datasets and reranking in audio separation models. We release our code and pre-trained models at: https://github.com/facebookresearch/sam-audio.
Related papers
- SAM Audio: Segment Anything in Audio [55.50609519820557]
General audio source separation is a key capability for multimodal AI systems.<n>We present SAM Audio, a foundation model for general audio separation.<n>It unifies text, visual, and temporal span prompting within a single framework.
arXiv Detail & Related papers (2025-12-19T22:14:23Z) - JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation [16.067014259345743]
We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset.<n>Even the best-performing Omni-LLM achieves an average accuracy of only 62.6%, outperforming uni-modal baselines.
arXiv Detail & Related papers (2025-12-14T17:23:21Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z) - Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound [46.7144966835279]
This paper addresses the need for automated systems capable of predicting audio aesthetics without human intervention.<n>We propose new annotation guidelines that decompose human listening perspectives into four distinct axes.<n>We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality.
arXiv Detail & Related papers (2025-02-07T18:15:57Z) - Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores [18.26082503192707]
We develop a PEAVS (Perceptual Evaluation of Audio-Visual Synchrony) score, a novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization.
In our experiments, we observe a relative gain 50% over a natural extension of Fr'eche't based metrics for Audio-Visual synchrony.
arXiv Detail & Related papers (2024-04-10T20:32:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.