L2 proficiency assessment using self-supervised speech representations
- URL: http://arxiv.org/abs/2211.08849v1
- Date: Wed, 16 Nov 2022 11:47:20 GMT
- Title: L2 proficiency assessment using self-supervised speech representations
- Authors: Stefano Bann\`o, Kate M. Knill, Marco Matassoni, Vyas Raina, Mark J.
F. Gales
- Abstract summary: This work extends the initial analysis conducted on a self-supervised speech representation based scheme, requiring no speech recognition, to a large scale proficiency test.
The performance of the self-supervised, wav2vec 2.0, system is compared to a high performance hand-crafted assessment system and a BERT-based text system.
Though the wav2vec 2.0 based system is found to be sensitive to the nature of the response, it can be configured to yield comparable performance to systems requiring a speech transcription.
- Score: 35.70742768910494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There has been a growing demand for automated spoken language assessment
systems in recent years. A standard pipeline for this process is to start with
a speech recognition system and derive features, either hand-crafted or based
on deep-learning, that exploit the transcription and audio. Though these
approaches can yield high performance systems, they require speech recognition
systems that can be used for L2 speakers, and preferably tuned to the specific
form of test being deployed. Recently a self-supervised speech representation
based scheme, requiring no speech recognition, was proposed. This work extends
the initial analysis conducted on this approach to a large scale proficiency
test, Linguaskill, that comprises multiple parts, each designed to assess
different attributes of a candidate's speaking proficiency. The performance of
the self-supervised, wav2vec 2.0, system is compared to a high performance
hand-crafted assessment system and a BERT-based text system both of which use
speech transcriptions. Though the wav2vec 2.0 based system is found to be
sensitive to the nature of the response, it can be configured to yield
comparable performance to systems requiring a speech transcription, and yields
gains when appropriately combined with standard approaches.
Related papers
- Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment [1.0359008237358598]
Dysarthria is a disability that causes a disturbance in the human speech system.
We introduce gammatonegram as an effective method to represent audio files with discriminative details.
We convert each speech file into an image and propose image recognition system to classify speech in different scenarios.
arXiv Detail & Related papers (2023-07-06T21:10:50Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Automatic Speech recognition for Speech Assessment of Preschool Children [4.554894288663752]
The acoustic and linguistic features of preschool speech are investigated in this study.
Wav2Vec 2.0 is a paradigm that could be used to build a robust end-to-end speech recognition system.
arXiv Detail & Related papers (2022-03-24T07:15:24Z) - Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number
of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z) - Mixtures of Deep Neural Experts for Automated Speech Scoring [11.860560781894458]
The paper copes with the task of automatic assessment of second language proficiency from the language learners' spoken responses to test prompts.
The approach relies on two separate modules: (1) an automatic speech recognition system that yields text transcripts of the spoken interactions involved, and (2) a multiple classifier system based on deep learners that ranks the transcripts into proficiency classes.
arXiv Detail & Related papers (2021-06-23T15:44:50Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.