Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist
- URL: http://arxiv.org/abs/2510.21584v1
- Date: Fri, 24 Oct 2025 15:51:10 GMT
- Title: Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist
- Authors: Kellen Parker van Dam, Abishek Stephen,
- Abstract summary: Unsupervised anomaly detection methods are applied to a multilingual dataset of Kokborok varieties with Bangla.<n>Character-level and syllable-level phonotactic features are used to identify potential transcription errors and borrowings.<n>The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.
Related papers
- What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models [0.19116784879310025]
Cross-lingual information retrieval is challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models.<n>Existing pipelines often rely on translation and monolingual retrievals, which add computational overhead and noise, performance.<n>This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets.
arXiv Detail & Related papers (2025-11-24T17:17:40Z) - Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition [61.601626186678146]
We propose a method which allows corrections of substitution errors to improve the recognition accuracy of challenging words.<n>We show that with this method we get a relative improvement in biased word error rate of up to 8%, while maintaining a competitive overall word error rate.
arXiv Detail & Related papers (2025-06-23T14:42:03Z) - Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z) - Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs)<n>We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences.<n>This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z) - Localizing Factual Inconsistencies in Attributable Text Generation [74.11403803488643]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.<n>We show that QASemConsistency yields factual consistency scores that correlate well with human judgments.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset [13.041053110012246]
We introduce a statistical test, the Preference Proportion Test, for identifying such unreliable subsets.
We find that filtering this low-quality data out when training models for the downstream task of phonetic transcription brings substantial benefits.
arXiv Detail & Related papers (2024-10-05T21:41:49Z) - Block the Label and Noise: An N-Gram Masked Speller for Chinese Spell
Checking [0.0]
This paper proposes an n-gram masking layer that masks current and/or surrounding tokens to avoid label leakage and error disturbance.
Experiments on SIGHAN datasets have demonstrated that the pluggable n-gram masking mechanism can improve the performance of prevalent CSC models.
arXiv Detail & Related papers (2023-05-05T06:43:56Z) - Detecting Label Errors using Pre-Trained Language Models [37.82128817976385]
We show that large pre-trained language models are extremely capable of identifying label errors in datasets.
We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP.
arXiv Detail & Related papers (2022-05-25T11:59:39Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.