Automated Speech Scoring System Under The Lens: Evaluating and
interpreting the linguistic cues for language proficiency
- URL: http://arxiv.org/abs/2111.15156v1
- Date: Tue, 30 Nov 2021 06:28:58 GMT
- Title: Automated Speech Scoring System Under The Lens: Evaluating and
interpreting the linguistic cues for language proficiency
- Authors: Pakhi Bamdev, Manraj Singh Grover, Yaman Kumar Singla, Payman Vafaee,
Mika Hama, Rajiv Ratn Shah
- Abstract summary: We utilize classical machine learning models to formulate a speech scoring task as both a classification and a regression problem.
First, we extract linguist features under five categories (fluency, pronunciation, content, grammar and vocabulary, and acoustic) and train models to grade responses.
In comparison, we find that the regression-based models perform equivalent to or better than the classification approach.
- Score: 26.70127591966917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: English proficiency assessments have become a necessary metric for filtering
and selecting prospective candidates for both academia and industry. With the
rise in demand for such assessments, it has become increasingly necessary to
have the automated human-interpretable results to prevent inconsistencies and
ensure meaningful feedback to the second language learners. Feature-based
classical approaches have been more interpretable in understanding what the
scoring model learns. Therefore, in this work, we utilize classical machine
learning models to formulate a speech scoring task as both a classification and
a regression problem, followed by a thorough study to interpret and study the
relation between the linguistic cues and the English proficiency level of the
speaker. First, we extract linguist features under five categories (fluency,
pronunciation, content, grammar and vocabulary, and acoustic) and train models
to grade responses. In comparison, we find that the regression-based models
perform equivalent to or better than the classification approach. Second, we
perform ablation studies to understand the impact of each of the feature and
feature categories on the performance of proficiency grading. Further, to
understand individual feature contributions, we present the importance of top
features on the best performing algorithm for the grading task. Third, we make
use of Partial Dependence Plots and Shapley values to explore feature
importance and conclude that the best performing trained model learns the
underlying rubrics used for grading the dataset used in this study.
Related papers
- Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Why do you cite? An investigation on citation intents and decision-making classification processes [1.7812428873698407]
This study emphasizes the importance of trustfully classifying citation intents.
We present a study utilizing advanced Ensemble Strategies for Citation Intent Classification (CIC)
One of our models sets as a new state-of-the-art (SOTA) with an 89.46% Macro-F1 score on the SciCite benchmark.
arXiv Detail & Related papers (2024-07-18T09:29:33Z) - Holmes: A Benchmark to Assess the Linguistic Competence of Language Models [59.627729608055006]
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence.
We use computation-based probing to examine LMs' internal representations regarding distinct linguistic phenomena.
As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities.
arXiv Detail & Related papers (2024-04-29T17:58:36Z) - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy [27.454549324141087]
We propose a novel VQA benchmark based on well-known visual classification datasets.
We also suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category.
Our contributions aim to lay the foundation for more precise and meaningful assessments.
arXiv Detail & Related papers (2024-02-11T18:26:18Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Under the Microscope: Interpreting Readability Assessment Models for
Filipino [0.0]
We dissect machine learning-based readability assessment models in Filipino by performing global and local model interpretation.
Results show that using a model trained with top features from global interpretation obtained higher performance than the ones using features selected by Spearman correlation.
arXiv Detail & Related papers (2021-10-01T01:27:10Z) - Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers [0.05857406612420462]
Large-scale, pre-trained language models achieve human-level and superhuman accuracy on existing language understanding tasks.
We propose evaluating systems through a novel measure of prediction coherence.
arXiv Detail & Related papers (2021-09-10T15:04:23Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - General-Purpose Speech Representation Learning through a Self-Supervised
Multi-Granularity Framework [114.63823178097402]
This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning.
Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales.
arXiv Detail & Related papers (2021-02-03T08:13:21Z) - Linguistic Features for Readability Assessment [0.0]
It is unknown whether augmenting deep learning models with linguistically motivated features would improve performance further.
We find that, given sufficient training data, augmenting deep learning models with linguistically motivated features does not improve state-of-the-art performance.
Our results provide preliminary evidence for the hypothesis that the state-of-the-art deep learning models represent linguistic features of the text related to readability.
arXiv Detail & Related papers (2020-05-30T22:14:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.