Related papers: Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition

Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition

URL: http://arxiv.org/abs/2409.12386v1
Date: Thu, 19 Sep 2024 01:02:31 GMT
Title: Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition
Authors: Chien-Chun Wang, Li-Wei Chen, Cheng-Kang Chou, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang,
Abstract summary: We propose a channel-aware data simulation method for robust automatic speech recognition training. Our method harnesses the synergistic power of channel-extractive techniques and generative adversarial networks (GANs) We evaluate our method on the challenging Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora, achieving relative character error rate (CER) reductions of 20.02% and 9.64%, respectively.
Score: 23.9811164130045
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. Our method harnesses the synergistic power of channel-extractive techniques and generative adversarial networks (GANs). We first train a channel encoder capable of extracting embeddings from arbitrary audio. On top of this, channel embeddings are extracted using a minimal amount of target-domain data and used to guide a GAN-based speech synthesizer. This synthesizer generates speech that faithfully preserves the phonetic content of the input while mimicking the channel characteristics of the target domain. We evaluate our method on the challenging Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora, achieving relative character error rate (CER) reductions of 20.02% and 9.64%, respectively, compared to the baselines. These results highlight the efficacy of our channel-aware data simulation method for bridging the gap between source- and target-domain acoustics.

Related papers

Discrete-Space Generative AI Pipeline for Semantic Transmission of Signals [0.2291770711277359]
Discernment is a semantic communication system that transmits the meaning of physical signals (baseband radio and audio) over a technical channel using GenAI models.<n>We show that Discernment maintains semantic integrity even as channel capacity severely degrades.
arXiv Detail & Related papers (2026-02-14T02:11:46Z)
Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement [24.109107195976346]
URSA-GAN is a domain-aware generative framework designed to mitigate mismatches in both noise and channel conditions.<n>We show that URSA-GAN effectively reduces character error rates in ASR and improves metrics in SE across diverse noisy and mismatched channel scenarios.
arXiv Detail & Related papers (2026-02-04T08:16:22Z)
Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation [25.410770364140856]
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) We introduce the notion of dynamic perturbation, which can inject controlled perturbations into the noise embeddings during inference.
arXiv Detail & Related papers (2024-09-03T02:29:01Z)
Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture [11.063156506583562]
We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.
arXiv Detail & Related papers (2024-06-05T13:50:59Z)
Histogram Layer Time Delay Neural Networks for Passive Sonar Classification [58.720142291102135]
A novel method combines a time delay neural network and histogram layer to incorporate statistical contexts for improved feature learning and underwater acoustic target classification. The proposed method outperforms the baseline model, demonstrating the utility in incorporating statistical contexts for passive sonar target recognition.
arXiv Detail & Related papers (2023-07-25T19:47:26Z)
Joint Channel Estimation and Feedback with Masked Token Transformers in Massive MIMO Systems [74.52117784544758]
This paper proposes an encoder-decoder based network that unveils the intrinsic frequency-domain correlation within the CSI matrix. The entire encoder-decoder network is utilized for channel compression. Our method outperforms state-of-the-art channel estimation and feedback techniques in joint tasks.
arXiv Detail & Related papers (2023-06-08T06:15:17Z)
Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z)
Adaptive re-calibration of channel-wise features for Adversarial Audio Classification [0.0]
We propose a recalibration of features using attention feature fusion for synthetic speech detection. We compare its performance against different detection methods including End2End models and Resnet-based models. We also demonstrate that the combination of Linear frequency cepstral coefficients (LFCC) and Mel Frequency cepstral coefficients (MFCC) using the attentional feature fusion technique creates better input features representations.
arXiv Detail & Related papers (2022-10-21T04:21:56Z)
Blackbox Untargeted Adversarial Testing of Automatic Speech Recognition Systems [1.599072005190786]
Speech recognition systems are prevalent in applications for voice navigation and voice control of domestic appliances. Deep neural networks (DNNs) have been shown to be susceptible to adversarial perturbations. To help test the correctness of ASRS, we propose techniques that automatically generate blackbox.
arXiv Detail & Related papers (2021-12-03T10:21:47Z)
Time-domain Speech Enhancement with Generative Adversarial Learning [53.74228907273269]
This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN) TSEGAN is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem. In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN.
arXiv Detail & Related papers (2021-03-30T08:09:49Z)
Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)
Data-Driven Symbol Detection via Model-Based Machine Learning [117.58188185409904]
We review a data-driven framework to symbol detection design which combines machine learning (ML) and model-based algorithms. In this hybrid approach, well-known channel-model-based algorithms are augmented with ML-based algorithms to remove their channel-model-dependence. Our results demonstrate that these techniques can yield near-optimal performance of model-based algorithms without knowing the exact channel input-output statistical relationship.
arXiv Detail & Related papers (2020-02-14T06:58:27Z)
Robust Multi-channel Speech Recognition using Frequency Aligned Network [23.397670239950187]
We use frequency aligned network for robust automatic speech recognition. We show that our multi-channel acoustic model with a frequency aligned network shows up to 18% relative reduction in word error rate.
arXiv Detail & Related papers (2020-02-06T21:47:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.