Adapting the NICT-JLE Corpus for Disfluency Detection Models
- URL: http://arxiv.org/abs/2308.02482v1
- Date: Fri, 4 Aug 2023 17:54:52 GMT
- Title: Adapting the NICT-JLE Corpus for Disfluency Detection Models
- Authors: Lucy Skidmore and Roger K. Moore
- Abstract summary: This paper describes the adaptation of the NICT-JLE corpus to a format suitable for disfluency detection model training and evaluation.
Points of difference between the NICT-JLE and Switchboard corpora are explored, followed by a detailed overview of adaptations to the tag set and meta-features.
The result of this work provides a standardised train, heldout and test set for use in future research on disfluency detection for learner speech.
- Score: 9.90780328490921
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The detection of disfluencies such as hesitations, repetitions and false
starts commonly found in speech is a widely studied area of research. With a
standardised process for evaluation using the Switchboard Corpus, model
performance can be easily compared across approaches. This is not the case for
disfluency detection research on learner speech, however, where such datasets
have restricted access policies, making comparison and subsequent development
of improved models more challenging. To address this issue, this paper
describes the adaptation of the NICT-JLE corpus, containing approximately 300
hours of English learners' oral proficiency tests, to a format that is suitable
for disfluency detection model training and evaluation. Points of difference
between the NICT-JLE and Switchboard corpora are explored, followed by a
detailed overview of adaptations to the tag set and meta-features of the
NICT-JLE corpus. The result of this work provides a standardised train, heldout
and test set for use in future research on disfluency detection for learner
speech.
Related papers
- Corpus-informed Retrieval Augmented Generation of Clarifying Questions [23.123116796159717]
This study aims to develop models that generate corpus informed clarifying questions for web search.
In current datasets search intents are largely unsupported by the corpus, which is problematic both for training and evaluation.
We propose dataset augmentation methods that align the ground truth clarifications with the retrieval corpus.
arXiv Detail & Related papers (2024-09-27T09:20:42Z) - Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.56493934296687]
We introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection.
We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text.
arXiv Detail & Related papers (2024-09-23T07:55:35Z) - Contextual Spelling Correction with Language Model for Low-resource Setting [0.0]
A small-scale word-based transformer LM is trained to provide the SC model with contextual understanding.
Probability of error happening(error model) is extracted from the corpus.
Combination of LM and error model is used to develop the SC model through the well-known noisy channel framework.
arXiv Detail & Related papers (2024-04-28T05:29:35Z) - Probing Critical Learning Dynamics of PLMs for Hate Speech Detection [39.970726250810635]
Despite widespread adoption, there is a lack of research into how various critical aspects of pretrained language models affect their performance in hate speech detection.
We deep dive into comparing different pretrained models, evaluating their seed robustness, finetuning settings, and the impact of pretraining data collection time.
Our analysis reveals early peaks for downstream tasks during pretraining, the limited benefit of employing a more recent pretraining corpus, and the significance of specific layers during finetuning.
arXiv Detail & Related papers (2024-02-03T13:23:51Z) - The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding [8.448399308205266]
We introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects.
We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol.
arXiv Detail & Related papers (2023-11-29T10:40:52Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Active Learning for Sequence Tagging with Deep Pre-trained Models and
Bayesian Uncertainty Estimates [52.164757178369804]
Recent advances in transfer learning for natural language processing in conjunction with active learning open the possibility to significantly reduce the necessary annotation budget.
We conduct an empirical study of various Bayesian uncertainty estimation methods and Monte Carlo dropout options for deep pre-trained models in the active learning framework.
We also demonstrate that to acquire instances during active learning, a full-size Transformer can be substituted with a distilled version, which yields better computational performance.
arXiv Detail & Related papers (2021-01-20T13:59:25Z) - Unsupervised neural adaptation model based on optimal transport for
spoken language identification [54.96267179988487]
Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded.
We propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID.
arXiv Detail & Related papers (2020-12-24T07:37:19Z) - End-to-End Speech Recognition and Disfluency Removal [15.910282983166024]
This paper investigates the task of end-to-end speech recognition and disfluency removal.
We show that end-to-end models do learn to directly generate fluent transcripts.
We propose two new metrics that can be used for evaluating integrated ASR and disfluency models.
arXiv Detail & Related papers (2020-09-22T03:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.