SEE: Signal Embedding Energy for Quantifying Noise Interference in Large Audio Language Models
- URL: http://arxiv.org/abs/2601.07331v1
- Date: Mon, 12 Jan 2026 08:57:55 GMT
- Title: SEE: Signal Embedding Energy for Quantifying Noise Interference in Large Audio Language Models
- Authors: Yuanhe Zhang, Jiayu Tian, Yibo Zhang, Shilinlu Yan, Liang Lin, Zhenhong Zhou, Li Sun, Sen Su,
- Abstract summary: Signal Embedding Energy (SEE) is a method for quantifying the impact of noise intensity on LALM inputs.<n>SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98.<n>This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.
- Score: 49.313324100819955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Audio Language Models (LALMs) have been widely applied in real-time scenarios, such as in-car assistants and online meeting comprehension. In practice, audio inputs are often corrupted by device and environmental noise, leading to performance degradation. However, existing LALM studies on noise lack quantitative analysis and rely mainly on intuition and empirical observation, thus failing to understand practical robustness. To address this issue, we introduce Signal Embedding Energy (SEE), a method for quantifying the impact of noise intensity on LALM inputs, enabling the differentiation of LALM robustness in real-world deployments. SEE introduces a perspective based on structured activation subspaces derived from the model's internal representations, which more accurately captures its perception of noise than raw audio features. Across experiments, SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98. Surprisingly, traditional audio denoising methods are only marginally effective for LALMs, and, in some cases, even increase SEE and impair performance. This suggests a mismatch between speech-centric denoising objectives and the noise sensitivity of modern LALMs. Therefore, we propose a mitigation strategy derived from SEE to denoise LALM inputs, outperforming existing denoising methods. This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.
Related papers
- Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Feature Fusion [1.376408511310322]
Speech quality and intelligibility are significantly degraded in noisy environments.<n>This paper presents a novel transformer-based learning framework to address the single-channel noise suppression problem.
arXiv Detail & Related papers (2025-11-14T19:27:42Z) - DOA Estimation with Lightweight Network on LLM-Aided Simulated Acoustic Scenes [46.0445214387366]
Direction-of-Arrival (DOA) estimation is critical in spatial audio and acoustic signal processing.<n>We propose LightDOA, a lightweight DOA estimation model based on depthwise separable convolutions.<n> Experimental results show that LightDOA achieves satisfactory accuracy and robustness across various acoustic scenes.
arXiv Detail & Related papers (2025-11-11T09:15:06Z) - On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How? [22.185338324021117]
This work investigates the resilience of contemporary LLMs against frequent and structured character-level perturbations.<n>We introduce UCC-Inj, a practical method that inserts invisible Unicode control characters into text to discourage LLM misuse.<n>Despite strong obfuscation that fragments tokenization and reduces the signal-to-noise ratio significantly, many LLMs still maintain notable performance.
arXiv Detail & Related papers (2025-10-16T06:59:58Z) - Harnessing LLM for Noise-Robust Cognitive Diagnosis in Web-Based Intelligent Education Systems [12.91124422916318]
Large Language Models (LLMs) for cognitive diagnosis struggle with structured data and are prone to noise-induced misjudgments.<n>We propose Diffusion-based LLM framework for noise-robust cognitive diagnosis.<n>Our results show that our framework achieves optimal predictive performance across varying noise levels.
arXiv Detail & Related papers (2025-10-05T08:32:30Z) - Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z) - Disentangled Noisy Correspondence Learning [56.06801962154915]
Cross-modal retrieval is crucial in understanding latent correspondences across modalities.
DisNCL is a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning.
arXiv Detail & Related papers (2024-08-10T09:49:55Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Advancing Unsupervised Low-light Image Enhancement: Noise Estimation, Illumination Interpolation, and Self-Regulation [55.07472635587852]
Low-Light Image Enhancement (LLIE) techniques have made notable advancements in preserving image details and enhancing contrast.
These approaches encounter persistent challenges in efficiently mitigating dynamic noise and accommodating diverse low-light scenarios.
We first propose a method for estimating the noise level in low light images in a quick and accurate way.
We then devise a Learnable Illumination Interpolator (LII) to satisfy general constraints between illumination and input.
arXiv Detail & Related papers (2023-05-17T13:56:48Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.