Simple or Complex? Learning to Predict Readability of Bengali Texts
- URL: http://arxiv.org/abs/2012.07701v1
- Date: Wed, 9 Dec 2020 01:41:35 GMT
- Title: Simple or Complex? Learning to Predict Readability of Bengali Texts
- Authors: Susmoy Chakraborty, Mir Tafseer Nayeem, Wasi Uddin Ahmad
- Abstract summary: We present a readability analysis tool capable of analyzing text written in the Bengali language.
Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing.
- Score: 6.860272388539321
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Determining the readability of a text is the first step to its
simplification. In this paper, we present a readability analysis tool capable
of analyzing text written in the Bengali language to provide in-depth
information on its readability and complexity. Despite being the 7th most
spoken language in the world with 230 million native speakers, Bengali suffers
from a lack of fundamental resources for natural language processing.
Readability related research of the Bengali language so far can be considered
to be narrow and sometimes faulty due to the lack of resources. Therefore, we
correctly adopt document-level readability formulas traditionally used for U.S.
based education system to the Bengali language with a proper age-to-age
comparison. Due to the unavailability of large-scale human-annotated corpora,
we further divide the document-level task into sentence-level and experiment
with neural architectures, which will serve as a baseline for the future works
of Bengali readability prediction. During the process, we present several
human-annotated corpora and dictionaries such as a document-level dataset
comprising 618 documents with 12 different grade levels, a large-scale
sentence-level dataset comprising more than 96K sentences with simple and
complex labels, a consonant conjunct count algorithm and a corpus of 341 words
to validate the effectiveness of the algorithm, a list of 3,396 easy words, and
an updated pronunciation dictionary with more than 67K words. These resources
can be useful for several other tasks of this low-resource language. We make
our Code & Dataset publicly available at
https://github.com/tafseer-nayeem/BengaliReadability} for reproduciblity.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR [0.532018200832244]
This paper addresses the problem of improving speech recognition accuracy with lattice rescoring in low-resource languages.
We minimally augment the baseline language model with word unigram counts that are present in a larger text corpus of the target language but absent in the baseline.
We obtain 21.8% (Telugu) and 41.8% (Kannada) relative word error reduction with our proposed method.
arXiv Detail & Related papers (2024-03-16T14:34:31Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Bengali Handwritten Grapheme Classification: Deep Learning Approach [0.0]
We participate in a Kaggle competition citek_link where the challenge is to classify three constituent elements of a Bengali grapheme in the image.
We explore the performances of some existing neural network models such as Multi-Layer Perceptron (MLP) and state of the art ResNet50.
We propose our own convolution neural network (CNN) model for Bengali grapheme classification with validation root accuracy 95.32%, vowel accuracy 98.61%, and consonant accuracy 98.76%.
arXiv Detail & Related papers (2021-11-16T06:14:59Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Sentiment analysis in Bengali via transfer learning using multi-lingual
BERT [0.9883261192383611]
In this paper, we present manually tagged 2-class and 3-class SA datasets in Bengali.
We also demonstrate that the multi-lingual BERT model with relevant extensions can be trained via the approach of transfer learning.
This deep learning model achieves an accuracy of 71% for 2-class sentiment classification compared to the current state-of-the-art accuracy of 68%.
arXiv Detail & Related papers (2020-12-03T10:21:11Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.