On Training Targets and Activation Functions for Deep Representation
Learning in Text-Dependent Speaker Verification
- URL: http://arxiv.org/abs/2201.06426v1
- Date: Mon, 17 Jan 2022 14:32:51 GMT
- Title: On Training Targets and Activation Functions for Deep Representation
Learning in Text-Dependent Speaker Verification
- Authors: Achintya kr. Sarkar, Zheng-Hua Tan
- Abstract summary: Key considerations include training targets, activation functions, and loss functions.
We study a range of loss functions when speaker identity is used as the training target.
We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid.
- Score: 18.19207291891767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep representation learning has gained significant momentum in advancing
text-dependent speaker verification (TD-SV) systems. When designing deep neural
networks (DNN) for extracting bottleneck features, key considerations include
training targets, activation functions, and loss functions. In this paper, we
systematically study the impact of these choices on the performance of TD-SV.
For training targets, we consider speaker identity, time-contrastive learning
(TCL) and auto-regressive prediction coding with the first being supervised and
the last two being self-supervised. Furthermore, we study a range of loss
functions when speaker identity is used as the training target. With regard to
activation functions, we study the widely used sigmoid function, rectified
linear unit (ReLU), and Gaussian error linear unit (GELU). We experimentally
show that GELU is able to reduce the error rates of TD-SV significantly
compared to sigmoid, irrespective of training target. Among the three training
targets, TCL performs the best. Among the various loss functions, cross
entropy, joint-softmax and focal loss functions outperform the others. Finally,
score-level fusion of different systems is also able to reduce the error rates.
Experiments are conducted on the RedDots 2016 challenge database for TD-SV
using short utterances. For the speaker classifications, the well-known
Gaussian mixture model-universal background model (GMM-UBM) and i-vector
techniques are used.
Related papers
- Automatic debiasing of neural networks via moment-constrained learning [0.0]
Naively learning the regression function and taking a sample mean of the target functional results in biased estimators.
We propose moment-constrained learning as a new RR learning approach that addresses some shortcomings in automatic debiasing.
arXiv Detail & Related papers (2024-09-29T20:56:54Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Disposable Transfer Learning for Selective Source Task Unlearning [31.020636963762836]
Transfer learning is widely used for training deep neural networks (DNN) for building a powerful representation.
disposable transfer learning (DTL) disposes of only the source task without degrading the performance of the target task.
We show that GC loss is an effective approach to the DTL problem by showing that the model trained with GC loss retains the performance on the target task with a significantly reduced PL accuracy.
arXiv Detail & Related papers (2023-08-19T10:13:17Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Class Anchor Margin Loss for Content-Based Image Retrieval [97.81742911657497]
We propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimize for the L2 metric without the need of generating pairs.
We evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures.
arXiv Detail & Related papers (2023-06-01T12:53:10Z) - Bridging Precision and Confidence: A Train-Time Loss for Calibrating
Object Detection [58.789823426981044]
We propose a novel auxiliary loss formulation that aims to align the class confidence of bounding boxes with the accurateness of predictions.
Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios.
arXiv Detail & Related papers (2023-03-25T08:56:21Z) - Learning Bayesian Sparse Networks with Full Experience Replay for
Continual Learning [54.7584721943286]
Continual Learning (CL) methods aim to enable machine learning models to learn new tasks without catastrophic forgetting of those that have been previously mastered.
Existing CL approaches often keep a buffer of previously-seen samples, perform knowledge distillation, or use regularization techniques towards this goal.
We propose to only activate and select sparse neurons for learning current and past tasks at any stage.
arXiv Detail & Related papers (2022-02-21T13:25:03Z) - Deep F-measure Maximization for End-to-End Speech Understanding [52.36496114728355]
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
We perform experiments on two standard fairness datasets, Adult, Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset.
In all four of these tasks, F-measure results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function.
arXiv Detail & Related papers (2020-08-08T03:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.