Augmenting Transformer-Transducer Based Speaker Change Detection With
Token-Level Training Loss
- URL: http://arxiv.org/abs/2211.06482v1
- Date: Fri, 11 Nov 2022 21:09:58 GMT
- Title: Augmenting Transformer-Transducer Based Speaker Change Detection With
Token-Level Training Loss
- Authors: Guanlong Zhao, Quan Wang, Han Lu, Yiling Huang, Ignacio Lopez Moreno
- Abstract summary: We propose a novel token-based training strategy that improves Transformer-Transducer (T-T) based speaker change detection (SCD) performance.
Due to the sparsity of the speaker changes in the training data, the conventional T-T based SCD model loss leads to sub-optimal detection accuracy.
- Score: 15.304831835680847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work we propose a novel token-based training strategy that improves
Transformer-Transducer (T-T) based speaker change detection (SCD) performance.
The conventional T-T based SCD model loss optimizes all output tokens equally.
Due to the sparsity of the speaker changes in the training data, the
conventional T-T based SCD model loss leads to sub-optimal detection accuracy.
To mitigate this issue, we use a customized edit-distance algorithm to estimate
the token-level SCD false accept (FA) and false reject (FR) rates during
training and optimize model parameters to minimize a weighted combination of
the FA and FR, focusing the model on accurately predicting speaker changes. We
also propose a set of evaluation metrics that align better with commercial use
cases. Experiments on a group of challenging real-world datasets show that the
proposed training method can significantly improve the overall performance of
the SCD model with the same number of parameters.
Related papers
- Test-time adaptation for geospatial point cloud semantic segmentation with distinct domain shifts [6.80671668491958]
Test-time adaptation (TTA) allows direct adaptation of a pre-trained model to unlabeled data during inference stage without access to source data or additional training.
We propose three domain shift paradigms: photogrammetric to airborne LiDAR, airborne to mobile LiDAR, and synthetic to mobile laser scanning.
Experimental results show our method improves classification accuracy by up to 20% mIoU, outperforming other methods.
arXiv Detail & Related papers (2024-07-08T15:40:28Z) - Challenging Gradient Boosted Decision Trees with Tabular Transformers for Fraud Detection at Booking.com [1.6702285371066043]
Transformer-based neural networks, empowered by Self-Supervised Learning (SSL), have demonstrated unprecedented performance across various domains.
In this paper, we aim to challenge GBDTs with tabular Transformers on a typical task faced in e-commerce, namely fraud detection.
Our methodology leverages the capabilities of Transformers to learn transferable representations using all available data by means of SSL.
The proposed approach outperforms heavily tuned GBDTs by a considerable margin of the Average Precision (AP) score.
arXiv Detail & Related papers (2024-05-22T14:38:48Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR [54.23941663326509]
Frequent speaker changes can make speaker change prediction difficult.
We propose boundary-aware serialized output training (BA-SOT)
Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%.
arXiv Detail & Related papers (2023-05-23T06:08:13Z) - Parameter-Efficient Learning for Text-to-Speech Accent Adaptation [58.356667204518985]
This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS)
A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2% to 0.8% of original trainable parameters.
Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning.
arXiv Detail & Related papers (2023-05-18T22:02:59Z) - Remote Sensing Change Detection With Transformers Trained from Scratch [62.96911491252686]
transformer-based change detection (CD) approaches either employ a pre-trained model trained on large-scale image classification ImageNet dataset or rely on first pre-training on another CD dataset and then fine-tuning on the target benchmark.
We develop an end-to-end CD approach with transformers that is trained from scratch and yet achieves state-of-the-art performance on four public benchmarks.
arXiv Detail & Related papers (2023-04-13T17:57:54Z) - Fast and accurate factorized neural transducer for text adaption of
end-to-end speech recognition models [23.21666928497697]
The improved adaptation ability of Factorized neural transducer (FNT) on text-only adaptation data came at the cost of lowered accuracy compared to the standard neural transducer model.
A combination of these approaches results in a relative word-error-rate reduction of 9.48% from the standard FNT model.
arXiv Detail & Related papers (2022-12-05T02:52:21Z) - Improving speech recognition models with small samples for air traffic
control systems [9.322392779428505]
In this work, a novel training approach based on pretraining and transfer learning is proposed to address the issue of small training samples.
Three real ATC datasets are used to validate the proposed ASR model and training strategies.
The experimental results demonstrate that the ASR performance is significantly improved on all three datasets.
arXiv Detail & Related papers (2021-02-16T08:28:52Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Unsupervised neural adaptation model based on optimal transport for
spoken language identification [54.96267179988487]
Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded.
We propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID.
arXiv Detail & Related papers (2020-12-24T07:37:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.