Large-scale learning of generalised representations for speaker
recognition
- URL: http://arxiv.org/abs/2210.10985v1
- Date: Thu, 20 Oct 2022 03:08:18 GMT
- Title: Large-scale learning of generalised representations for speaker
recognition
- Authors: Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesong Lee, Hye-jin Shim,
Youngki Kwon, Joon Son Chung, Shinji Watanabe
- Abstract summary: We develop a speaker recognition model to be used in diverse scenarios.
We investigate several new training data configurations combining a few existing datasets.
We find that MFA-Conformer with the least inductive bias generalises the best.
- Score: 52.978310296712834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of this work is to develop a speaker recognition model to be
used in diverse scenarios. We hypothesise that two components should be
adequately configured to build such a model. First, adequate architecture would
be required. We explore several recent state-of-the-art models, including
ECAPA-TDNN and MFA-Conformer, as well as other baselines. Second, a massive
amount of data would be required. We investigate several new training data
configurations combining a few existing datasets. The most extensive
configuration includes over 87k speakers' 10.22k hours of speech. Four
evaluation protocols are adopted to measure how the trained model performs in
diverse scenarios. Through experiments, we find that MFA-Conformer with the
least inductive bias generalises the best. We also show that training with
proposed large data configurations gives better performance. A boost in
generalisation is observed, where the average performance on four evaluation
protocols improves by more than 20%. In addition, we also demonstrate that
these models' performances can improve even further when increasing capacity.
Related papers
- VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks.
Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - Dividable Configuration Performance Learning [4.949726352498762]
We propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed DaL.
DaL is based on the new paradigm of dividable learning that builds a model via "divide-and-learn"
arXiv Detail & Related papers (2024-09-11T21:23:23Z) - An Approach to Build Zero-Shot Slot-Filling System for Industry-Grade Conversational Assistants [9.537527104259153]
Key requirements of this system include: 1) usage of smaller-sized models to meet low latency requirements and to enable convenient and cost-effective cloud and customer premise deployments.
We adopt a fine-tuning approach where a pre-trained LLM is fine-tuned into a slot-filling model using task specific data.
Results show that our prescribed approach for slot-filling model building has resulted in 6.9% relative improvement of F1 metric over the best baseline on a realistic benchmark, while at the same time reducing the latency by 57%.
arXiv Detail & Related papers (2024-06-13T06:24:52Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - What Matters in Training a GPT4-Style Language Model with Multimodal
Inputs? [24.676820488258336]
Large Language Models (LLMs) have displayed exceptional multi-modal capabilities in following open-ended instructions given images.
These models rely on design choices such as network structures, training data, and training strategies.
This paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models.
arXiv Detail & Related papers (2023-07-05T17:44:28Z) - InstructBLIP: Towards General-purpose Vision-Language Models with
Instruction Tuning [43.54069813039309]
We study vision-language instruction tuning based on the pretrained BLIP-2 models.
InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets.
Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks.
arXiv Detail & Related papers (2023-05-11T00:38:10Z) - Multitask Learning for Low Resource Spoken Language Understanding [26.106133114838215]
We train models on dual objectives with automatic speech recognition and intent classification or sentiment classification.
Our models, although being of modest size, show improvements over models trained end-to-end on intent classification.
We study the performance of the models in low-resource scenario by training the models with as few as one example per class.
arXiv Detail & Related papers (2022-11-24T16:38:17Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - Feeding What You Need by Understanding What You Learned [54.400455868448695]
Machine Reading (MRC) reveals the ability to understand a given text passage and answer questions based on it.
Existing research works in MRC rely heavily on large-size models and corpus to improve the performance evaluated by metrics such as Exact Match.
We argue that a deep understanding of model capabilities and data properties can help us feed a model with appropriate training data.
arXiv Detail & Related papers (2022-03-05T14:15:59Z) - Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.