TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and
Punctuation model evaluation and selection
- URL: http://arxiv.org/abs/2210.15104v1
- Date: Thu, 27 Oct 2022 01:11:32 GMT
- Title: TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and
Punctuation model evaluation and selection
- Authors: Piyush Behre, Sharman Tan, Amy Shah, Harini Kesavamoorthy, Shuangyu
Chang, Fei Zuo, Chris Basoglu, Sayan Pathak
- Abstract summary: Punctuation and readability are key to readability in Automatic Speech Recognition.
Human evaluation is expensive, time-consuming, and suffers from large inter-observer variability.
We present TRScore, a novel readability measure using the GPT model to evaluate different segmentation and punctuation systems.
- Score: 1.4720080476520687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Punctuation and Segmentation are key to readability in Automatic Speech
Recognition (ASR), often evaluated using F1 scores that require high-quality
human transcripts and do not reflect readability well. Human evaluation is
expensive, time-consuming, and suffers from large inter-observer variability,
especially in conversational speech devoid of strict grammatical structures.
Large pre-trained models capture a notion of grammatical structure. We present
TRScore, a novel readability measure using the GPT model to evaluate different
segmentation and punctuation systems. We validate our approach with human
experts. Additionally, our approach enables quantitative assessment of text
post-processing techniques such as capitalization, inverse text normalization
(ITN), and disfluency on overall readability, which traditional word error rate
(WER) and slot error rate (SER) metrics fail to capture. TRScore is strongly
correlated to traditional F1 and human readability scores, with Pearson's
correlation coefficients of 0.67 and 0.98, respectively. It also eliminates the
need for human transcriptions for model selection.
Related papers
- Analysing Zero-Shot Readability-Controlled Sentence Simplification [54.09069745799918]
We investigate how different types of contextual information affect a model's ability to generate sentences with the desired readability.
Results show that all tested models struggle to simplify sentences due to models' limitations and characteristics of the source sentences.
Our experiments also highlight the need for better automatic evaluation metrics tailored to RCTS.
arXiv Detail & Related papers (2024-09-30T12:36:25Z) - Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really
Need Reference? [3.2528685897001455]
This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference.
Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures.
To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages.
arXiv Detail & Related papers (2023-12-03T20:09:23Z) - LibriSpeech-PC: Benchmark for Evaluation of Punctuation and
Capitalization Capabilities of end-to-end ASR Models [58.790604613878216]
We introduce a LibriSpeech-PC benchmark designed to assess the punctuation and capitalization prediction capabilities of end-to-end ASR models.
The benchmark includes a LibriSpeech-PC dataset with restored punctuation and capitalization, a novel evaluation metric called Punctuation Error Rate (PER) that focuses on punctuation marks, and initial baseline models.
arXiv Detail & Related papers (2023-10-04T16:23:37Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - Assessing ASR Model Quality on Disordered Speech using BERTScore [5.489867271342724]
Word Error Rate (WER) is the primary metric used to assess automatic speech recognition (ASR) model quality.
It has been shown that ASR models tend to have much higher WER on speakers with speech impairments than typical English speakers.
This study investigates the use of BERTScore, an evaluation metric for text generation, to provide a more informative measure of ASR model quality and usefulness.
arXiv Detail & Related papers (2022-09-21T18:33:33Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Automated Evaluation of Standardized Dementia Screening Tests [0.18472148461613155]
We report on a study that consists of a semi-standardized history taking followed by two standardized neuropsychological tests.
The tests include basic tasks such as naming objects, learning word lists, but also widely used tools such as the MMSE.
We show that using word alternatives helps to mitigate recognition errors and subsequently improves correlation with expert scores.
arXiv Detail & Related papers (2022-06-13T14:41:27Z) - Robust Prediction of Punctuation and Truecasing for Medical ASR [18.08508027663331]
This paper proposes a conditional joint modeling framework for prediction of punctuation and truecasing.
We also present techniques for domain and task specific adaptation by fine-tuning masked language models with medical domain data.
arXiv Detail & Related papers (2020-07-04T07:15:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.