Cross-Lingual Speaker Identification Using Distant Supervision
- URL: http://arxiv.org/abs/2210.05780v1
- Date: Tue, 11 Oct 2022 20:49:44 GMT
- Title: Cross-Lingual Speaker Identification Using Distant Supervision
- Authors: Ben Zhou, Dian Yu, Dong Yu, Dan Roth
- Abstract summary: We propose a speaker identification framework that addresses issues such as lack of contextual reasoning and poor cross-lingual generalization.
We show that the resulting model outperforms previous state-of-the-art methods on two English speaker identification benchmarks by up to 9% in accuracy and 5% with only distant supervision.
- Score: 84.51121411280134
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker identification, determining which character said each utterance in
literary text, benefits many downstream tasks. Most existing approaches use
expert-defined rules or rule-based features to directly approach this task, but
these approaches come with significant drawbacks, such as lack of contextual
reasoning and poor cross-lingual generalization. In this work, we propose a
speaker identification framework that addresses these issues. We first extract
large-scale distant supervision signals in English via general-purpose tools
and heuristics, and then apply these weakly-labeled instances with a focus on
encouraging contextual reasoning to train a cross-lingual language model. We
show that the resulting model outperforms previous state-of-the-art methods on
two English speaker identification benchmarks by up to 9% in accuracy and 5%
with only distant supervision, as well as two Chinese speaker identification
datasets by up to 4.7%.
Related papers
- Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task.
This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z) - Semi-supervised cross-lingual speech emotion recognition [26.544999411050036]
Cross-lingual Speech Emotion Recognition remains a challenge in real-world applications.
We propose a Semi-Supervised Learning (SSL) method for cross-lingual emotion recognition when only few labeled examples in the target domain are available.
Our method is based on a Transformer and it adapts to the new domain by exploiting a pseudo-labeling strategy on the unlabeled utterances.
arXiv Detail & Related papers (2022-07-14T09:24:55Z) - Cross-lingual Low Resource Speaker Adaptation Using Phonological
Features [2.8080708404213373]
We train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages.
With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature.
arXiv Detail & Related papers (2021-11-17T12:33:42Z) - Few-Shot Cross-Lingual Stance Detection with Sentiment-Based
Pre-Training [32.800766653254634]
We present the most comprehensive study of cross-lingual stance detection to date.
We use 15 diverse datasets in 12 languages from 6 language families.
For our experiments, we build on pattern-exploiting training, proposing the addition of a novel label encoder.
arXiv Detail & Related papers (2021-09-13T15:20:06Z) - It's All in the Heads: Using Attention Heads as a Baseline for
Cross-Lingual Transfer in Commonsense Reasoning [4.200736775540874]
We design a simple approach to commonsense reasoning which trains a linear classifier with weights of multi-head attention as features.
The method performs competitively with recent supervised and unsupervised approaches for commonsense reasoning.
Most of the performance is given by the same small subset of attention heads for all studied languages.
arXiv Detail & Related papers (2021-06-22T21:25:43Z) - Graph-based Label Propagation for Semi-Supervised Speaker Identification [10.87690067963342]
We propose a graph-based semi-supervised learning approach for speaker identification in the household scenario.
We show that this approach makes effective use of unlabeled data and improves speaker identification accuracy compared to two state-of-the-art scoring methods.
arXiv Detail & Related papers (2021-06-15T15:10:33Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.