Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring
- URL: http://arxiv.org/abs/2109.00928v1
- Date: Mon, 30 Aug 2021 07:00:28 GMT
- Title: Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring
- Authors: Yaman Kumar Singla, Avykat Gupta, Shaurya Bagga, Changyou Chen, Balaji
Krishnamurthy, Rajiv Ratn Shah
- Abstract summary: We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
- Score: 60.55025339250815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Speech Scoring (ASS) is the computer-assisted evaluation of a
candidate's speaking proficiency in a language. ASS systems face many
challenges like open grammar, variable pronunciations, and unstructured or
semi-structured content. Recent deep learning approaches have shown some
promise in this domain. However, most of these approaches focus on extracting
features from a single audio, making them suffer from the lack of
speaker-specific context required to model such a complex task. We propose a
novel deep learning technique for non-native ASS, called speaker-conditioned
hierarchical modeling. In our technique, we take advantage of the fact that
oral proficiency tests rate multiple responses for a candidate. We extract
context vectors from these responses and feed them as additional
speaker-specific context to our network to score a particular response. We
compare our technique with strong baselines and find that such modeling
improves the model's average performance by 6.92% (maximum = 12.86%, minimum =
4.51%). We further show both quantitative and qualitative insights into the
importance of this additional context in solving the problem of ASS.
Related papers
- CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning [43.71388370559826]
This paper introduces a multi-talker speaking style captioning task to enhance the understanding of speaker and prosodic information.
We used large language models to generate descriptions for multi-talker speech.
We trained our model with pre-training on this captioning task followed by instruction tuning.
arXiv Detail & Related papers (2024-08-25T17:05:26Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech [26.533600745910437]
We propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities.
We also propose a new differentiable pruning method that allows the model to automatically learn the thresholds.
arXiv Detail & Related papers (2023-08-28T21:25:05Z) - Automatic Evaluation of Speaker Similarity [0.0]
We introduce a new automatic evaluation method for speaker similarity assessment, consistent with human perceptual scores.
Our experiments show that we can train a model to predict speaker similarity MUSHRA scores from speaker embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson score at the utterance level.
arXiv Detail & Related papers (2022-07-01T11:23:16Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based
Chatbots [47.40380290055558]
A new model, named Speaker-Aware BERT (SA-BERT), is proposed to make the model aware of the speaker change information.
A speaker-aware disentanglement strategy is proposed to tackle the entangled dialogues.
arXiv Detail & Related papers (2020-04-07T02:08:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.