DT-SV: A Transformer-based Time-domain Approach for Speaker Verification
- URL: http://arxiv.org/abs/2205.13249v1
- Date: Thu, 26 May 2022 09:36:26 GMT
- Title: DT-SV: A Transformer-based Time-domain Approach for Speaker Verification
- Authors: Nan Zhang, Jianzong Wang, Zhenhou Hong, Chendong Zhao, Xiaoyang Qu,
Jing Xiao
- Abstract summary: Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech.
We propose an approach to derive utterance-level speaker embeddings via a Transformer architecture.
We also introduce a learnable mel-fbank energy feature extractor named time-domain feature extractor.
- Score: 24.613926376221155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speaker verification (SV) aims to determine whether the speaker's identity of
a test utterance is the same as the reference speech. In the past few years,
extracting speaker embeddings using deep neural networks for SV systems has
gone mainstream. Recently, different attention mechanisms and Transformer
networks have been explored widely in SV fields. However, utilizing the
original Transformer in SV directly may have frame-level information waste on
output features, which could lead to restrictions on capacity and
discrimination of speaker embeddings. Therefore, we propose an approach to
derive utterance-level speaker embeddings via a Transformer architecture that
uses a novel loss function named diffluence loss to integrate the feature
information of different Transformer layers. Therein, the diffluence loss aims
to aggregate frame-level features into an utterance-level representation, and
it could be integrated into the Transformer expediently. Besides, we also
introduce a learnable mel-fbank energy feature extractor named time-domain
feature extractor that computes the mel-fbank features more precisely and
efficiently than the standard mel-fbank extractor. Combining Diffluence loss
and Time-domain feature extractor, we propose a novel Transformer-based
time-domain SV model (DT-SV) with faster training speed and higher accuracy.
Experiments indicate that our proposed model can achieve better performance in
comparison with other models.
Related papers
- iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - U-shaped Transformer: Retain High Frequency Context in Time Series
Analysis [0.5710971447109949]
In this paper, we consider the low-pass characteristics of transformers and try to incorporate the advantages of them.
We introduce patch merge and split operation to extract features with different scales and use larger datasets to fully make use of the transformer backbone.
Our experiments demonstrate that the model performs at an advanced level across multiple datasets with relatively low cost.
arXiv Detail & Related papers (2023-07-18T07:15:26Z) - A Differential Attention Fusion Model Based on Transformer for Time
Series Forecasting [4.666618110838523]
Time series forecasting is widely used in the fields of equipment life cycle forecasting, weather forecasting, traffic flow forecasting, and other fields.
Some scholars have tried to apply Transformer to time series forecasting because of its powerful parallel training ability.
The existing Transformer methods do not pay enough attention to the small time segments that play a decisive role in prediction.
arXiv Detail & Related papers (2022-02-23T10:33:12Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Wake Word Detection with Streaming Transformers [72.66551640048405]
We show that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate.
Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25%.
arXiv Detail & Related papers (2021-02-08T19:14:32Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.