Modality-Specific Speech Enhancement and Noise-Adaptive Fusion for Acoustic and Body-Conduction Microphone Framework
- URL: http://arxiv.org/abs/2508.17336v2
- Date: Thu, 28 Aug 2025 02:26:32 GMT
- Title: Modality-Specific Speech Enhancement and Noise-Adaptive Fusion for Acoustic and Body-Conduction Microphone Framework
- Authors: Yunsik Kim, Yoonyoung Chung,
- Abstract summary: Body-conduction microphone signals (BMS) bypass airborne sound, providing strong noise resistance.<n>We propose a novel multi-modal framework that combines BMS and acoustic microphone signals (AMS) to achieve both noise suppression and high-frequency reconstruction.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Body-conduction microphone signals (BMS) bypass airborne sound, providing strong noise resistance. However, a complementary modality is required to compensate for the inherent loss of high-frequency information. In this study, we propose a novel multi-modal framework that combines BMS and acoustic microphone signals (AMS) to achieve both noise suppression and high-frequency reconstruction. Unlike conventional multi-modal approaches that simply merge features, our method employs two specialized networks: a mapping-based model to enhance BMS and a masking-based model to denoise AMS. These networks are integrated through a dynamic fusion mechanism that adapts to local noise conditions, ensuring the optimal use of each modality's strengths. We performed evaluations on the TAPS dataset, augmented with DNS-2023 noise clips, using objective speech quality metrics. The results clearly demonstrate that our approach outperforms single-modal solutions in a wide range of noisy environments.
Related papers
- Learning to Separate RF Signals Under Uncertainty: Detect-Then-Separate vs. Unified Joint Models [53.79667447811139]
We show that a single deep neural architecture learns to jointly detect and separate when applied directly to the received signal.<n>These findings highlight UJM as a scalable and practical alternative to DTS, while opening new directions for unified separation under broader estimation.
arXiv Detail & Related papers (2026-02-04T15:25:02Z) - Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification [55.56234913868664]
We propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD) for reliable learning on multimodal data.<n>The proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
arXiv Detail & Related papers (2026-01-12T03:14:12Z) - Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition [2.0391237204597363]
Speech Emotion Recognition systems often degrade in performance when exposed to unpredictable acoustic interference.<n>We propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks.
arXiv Detail & Related papers (2025-12-20T10:05:58Z) - Real Noise Decoupling for Hyperspectral Image Denoising [14.247569090609828]
Hyperspectral image (HSI) denoising is a crucial step in enhancing the quality of HSIs.<n>Noise modeling methods can fit noise distributions to generate synthetic HSIs to train denoising networks.<n>We propose a multi-stage noise-decoupling framework that decomposes complex noise into explicitly modeled and implicitly modeled components.
arXiv Detail & Related papers (2025-11-21T12:23:07Z) - Towards Robust Image Denoising with Scale Equivariance [10.894808298340994]
We argue that incorporating scale-equivariant structures enables models to better adapt from training on spatially uniform noise to inference on spatially non-uniform degradations.<n>We propose a robust blind denoising framework equipped with two key components: a Heterogeneous Normalization Module (HNM) and an Interactive Gating Module (IGM)
arXiv Detail & Related papers (2025-08-05T00:06:28Z) - Backscatter Device-aided Integrated Sensing and Communication: A Pareto Optimization Framework [59.30060797118097]
Integrated sensing and communication (ISAC) systems potentially encounter significant performance degradation in densely obstructed urban non-line-of-sight scenarios.<n>This paper proposes a backscatter approximation (BD)-assisted ISAC system, which leverages passive BDs naturally distributed in environments of enhancement.
arXiv Detail & Related papers (2025-07-12T17:11:06Z) - $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z) - FM2S: Towards Spatially-Correlated Noise Modeling in Zero-Shot Fluorescence Microscopy Image Denoising [33.383511185170214]
Fluorescence Micrograph to Self (FM2S) is a zero-shot denoiser that achieves efficient Fluorescence Micrograph to Self (FM2S) denoising through three key innovations.<n>Experiments across FMI datasets demonstrate FM2S's superiority: It outperforms CVF-SID by 1.4dB PSNR on average while requiring 0.1% parameters of AP-BSN.
arXiv Detail & Related papers (2024-12-13T10:45:25Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Realistic Noise Synthesis with Diffusion Models [44.404059914652194]
Deep denoising models require extensive real-world training data, which is challenging to acquire.<n>We propose a novel Realistic Noise Synthesis Diffusor (RNSD) method using diffusion models to address these challenges.
arXiv Detail & Related papers (2023-05-23T12:56:01Z) - Improve Noise Tolerance of Robust Loss via Noise-Awareness [60.34670515595074]
We propose a meta-learning method which is capable of adaptively learning a hyper parameter prediction function, called Noise-Aware-Robust-Loss-Adjuster (NARL-Adjuster for brevity)
Four SOTA robust loss functions are attempted to be integrated with our algorithm, and comprehensive experiments substantiate the general availability and effectiveness of the proposed method in both its noise tolerance and performance.
arXiv Detail & Related papers (2023-01-18T04:54:58Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Multi-Modal Perception Attention Network with Self-Supervised Learning
for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities.
MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z) - Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial
Clustering Masks [14.942060304734497]
spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations.
LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings.
This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method.
arXiv Detail & Related papers (2020-12-02T22:35:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.