Design and Implementation of a Tool for Extracting Uzbek Syllables
- URL: http://arxiv.org/abs/2312.15779v1
- Date: Mon, 25 Dec 2023 17:46:58 GMT
- Title: Design and Implementation of a Tool for Extracting Uzbek Syllables
- Authors: Ulugbek Salaev, Elmurod Kuriyozov, Gayrat Matlatipov
- Abstract summary: Syllabification is a versatile linguistic tool with applications in linguistic research, language technology, education, and various fields.
We present a comprehensive approach to syllabification for the Uzbek language, including rule-based techniques and machine learning algorithms.
The results of our experiments show that both approaches achieved a high level of accuracy, exceeding 99%.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The accurate syllabification of words plays a vital role in various Natural
Language Processing applications. Syllabification is a versatile linguistic
tool with applications in linguistic research, language technology, education,
and various fields where understanding and processing language is essential. In
this paper, we present a comprehensive approach to syllabification for the
Uzbek language, including rule-based techniques and machine learning
algorithms. Our rule-based approach utilizes advanced methods for dividing
words into syllables, generating hyphenations for line breaks and count of
syllables. Additionally, we collected a dataset for evaluating and training
using machine learning algorithms comprising word-syllable mappings,
hyphenations, and syllable counts to predict syllable counts as well as for the
evaluation of the proposed model. Our results demonstrate the effectiveness and
efficiency of both approaches in achieving accurate syllabification. The
results of our experiments show that both approaches achieved a high level of
accuracy, exceeding 99%. This study provides valuable insights and
recommendations for future research on syllabification and related areas in not
only the Uzbek language itself, but also in other closely-related Turkic
languages with low-resource factor.
Related papers
- Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili [29.252250069388687]
Tokenization allows for the words to be split based on characters or subwords, creating word embeddings that best represent the structure of the language.
We propose a syllable tokenizer and adopt an experiment-centric approach to validate the proposed tokenizer based on the Swahili language.
arXiv Detail & Related papers (2024-03-26T17:26:50Z) - MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and
Phonetic Domains for Speech Representation Learning [0.76146285961466]
We present a methodology for linguistic feature extraction, focusing on automatically syllabifying words in multiple languages.
In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification.
The system was built with open-source components and resources.
arXiv Detail & Related papers (2023-10-17T19:27:23Z) - Revisiting Syllables in Language Modelling and their Application on
Low-Resource Machine Translation [1.2617078020344619]
Syllables provide shorter sequences than characters, require less-specialised extracting rules than morphemes, and their segmentation is not impacted by the corpus size.
We first explore the potential of syllables for open-vocabulary language modelling in 21 languages.
We use rule-based syllabification methods for six languages and address the rest with hyphenation, which works as a syllabification proxy.
arXiv Detail & Related papers (2022-10-05T18:55:52Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Unsupervised Multimodal Word Discovery based on Double Articulation
Analysis with Co-occurrence cues [7.332652485849632]
Human infants acquire their verbal lexicon with minimal prior knowledge of language.
This study proposes a novel fully unsupervised learning method for discovering speech units.
The proposed method can acquire words and phonemes from speech signals using unsupervised learning.
arXiv Detail & Related papers (2022-01-18T07:31:59Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification [81.80311855996584]
We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model.
We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
arXiv Detail & Related papers (2021-02-15T07:20:06Z) - A Hybrid Approach to Dependency Parsing: Combining Rules and Morphology
with Deep Learning [0.0]
We propose two approaches to dependency parsing especially for languages with restricted amount of training data.
Our first approach combines a state-of-the-art deep learning-based with a rule-based approach and the second one incorporates morphological information into the network.
The proposed methods are developed for Turkish, but can be adapted to other languages as well.
arXiv Detail & Related papers (2020-02-24T08:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.