Toward Cross-Domain Speech Recognition with End-to-End Models
- URL: http://arxiv.org/abs/2003.04194v1
- Date: Mon, 9 Mar 2020 15:19:53 GMT
- Title: Toward Cross-Domain Speech Recognition with End-to-End Models
- Authors: Thai-Son Nguyen, Sebastian St\"uker, Alex Waibel
- Abstract summary: In this paper, we empirically examine the difference in behavior between hybrid acoustic models and neural end-to-end systems.
We show that for the hybrid models, supplying additional training data from other domains with mismatched acoustic conditions does not increase the performance on specific domains.
Our end-to-end models optimized with sequence-based criterion generalize better than the hybrid models on diverse domains.
- Score: 18.637636841477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the area of multi-domain speech recognition, research in the past focused
on hybrid acoustic models to build cross-domain and domain-invariant speech
recognition systems. In this paper, we empirically examine the difference in
behavior between hybrid acoustic models and neural end-to-end systems when
mixing acoustic training data from several domains. For these experiments we
composed a multi-domain dataset from public sources, with the different domains
in the corpus covering a wide variety of topics and acoustic conditions such as
telephone conversations, lectures, read speech and broadcast news. We show that
for the hybrid models, supplying additional training data from other domains
with mismatched acoustic conditions does not increase the performance on
specific domains. However, our end-to-end models optimized with sequence-based
criterion generalize better than the hybrid models on diverse domains. In term
of word-error-rate performance, our experimental acoustic-to-word and
attention-based models trained on multi-domain dataset reach the performance of
domain-specific long short-term memory (LSTM) hybrid models, thus resulting in
multi-domain speech recognition systems that do not suffer in performance over
domain specific ones. Moreover, the use of neural end-to-end models eliminates
the need of domain-adapted language models during recognition, which is a great
advantage when the input domain is unknown.
Related papers
- Investigating the potential of Sparse Mixtures-of-Experts for multi-domain neural machine translation [59.41178047749177]
We focus on multi-domain Neural Machine Translation, with the goal of developing efficient models which can handle data from various domains seen during training and are robust to domains unseen during training.
We hypothesize that Sparse Mixture-of-Experts (SMoE) models are a good fit for this task, as they enable efficient model scaling.
We conduct a series of experiments aimed at validating the utility of SMoE for the multi-domain scenario, and find that a straightforward width scaling of Transformer is a simpler and surprisingly more efficient approach in practice, and reaches the same performance level as SMoE.
arXiv Detail & Related papers (2024-07-01T09:45:22Z) - Benchmarking Cross-Domain Audio-Visual Deception Detection [45.342156006617394]
We present the first cross-domain audio-visual deception detection benchmark.
We compare single-to-single and multi-to-single domain generalization performance.
We propose an algorithm to enhance the generalization performance.
arXiv Detail & Related papers (2024-05-11T12:06:31Z) - Domain Private Transformers for Multi-Domain Dialog Systems [2.7013801448234367]
This paper proposes domain privacy as a novel way to quantify how likely a conditional language model will leak across domains.
Experiments on membership inference attacks show that our proposed method has comparable resiliency to methods adapted from recent literature on differentially private language models.
arXiv Detail & Related papers (2023-05-23T16:27:12Z) - Multi-source Domain Adaptation for Text-independent Forensic Speaker
Recognition [36.83842373791537]
Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model.
Previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains.
Three novel adaptation methods are proposed to further promote adaptation performance across multiple acoustic domains.
arXiv Detail & Related papers (2022-11-17T22:11:25Z) - Cross-domain Voice Activity Detection with Self-Supervised
Representations [9.02236667251654]
Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal.
Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics.
We show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains.
arXiv Detail & Related papers (2022-09-22T14:53:44Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - DEAAN: Disentangled Embedding and Adversarial Adaptation Network for
Robust Speaker Representation Learning [69.70594547377283]
We propose a novel framework to disentangle speaker-related and domain-specific features.
Our framework can effectively generate more speaker-discriminative and domain-invariant speaker representations.
arXiv Detail & Related papers (2020-12-12T19:46:56Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.