MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language
for Readability Assessment
- URL: http://arxiv.org/abs/2109.04870v1
- Date: Fri, 10 Sep 2021 13:34:52 GMT
- Title: MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language
for Readability Assessment
- Authors: Kepa Bengoetxea and Itziar Gonzalez-Dios
- Abstract summary: MultiAzterTest is an open source NLP tool that analyzes texts on over 125 measures of cohesion,language, and readability for English, Spanish and Basque.
Using cross-lingual features, MultiAzterTest also obtains competitive results above all in a complex vs simple distinction.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Readability assessment is the task of determining how difficult or easy a
text is or which level/grade it has. Traditionally, language dependent
readability formula have been used, but these formulae take few text
characteristics into account. However, Natural Language Processing (NLP) tools
that assess the complexity of texts are able to measure more different features
and can be adapted to different languages. In this paper, we present the
MultiAzterTest tool: (i) an open source NLP tool which analyzes texts on over
125 measures of cohesion,language, and readability for English, Spanish and
Basque, but whose architecture is designed to easily adapt other languages;
(ii) readability assessment classifiers that improve the performance of
Coh-Metrix in English, Coh-Metrix-Esp in Spanish and ErreXail in Basque; iii) a
web tool. MultiAzterTest obtains 90.09 % in accuracy when classifying into
three reading levels (elementary, intermediate, and advanced) in English and
95.50 % in Basque and 90 % in Spanish when classifying into two reading levels
(simple and complex) using a SMO classifier. Using cross-lingual features,
MultiAzterTest also obtains competitive results above all in a complex vs
simple distinction.
Related papers
- Strategies for Arabic Readability Modeling [9.976720880041688]
Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility.
We present a set of experimental results on Arabic readability assessment using a diverse range of approaches.
arXiv Detail & Related papers (2024-07-03T11:54:11Z) - Classification of Human- and AI-Generated Texts for English, French,
German, and Spanish [0.138120109831448]
We analyze features to classify human- and AI-generated text for English, French, German and Spanish.
For the detection of AI-generated text, the combination of all proposed features performs best.
For the detection of AI-rephrased text, the systems with all features outperform systems with other features in many cases.
arXiv Detail & Related papers (2023-12-08T07:42:06Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Around the world in 60 words: A generative vocabulary test for online
research [12.91296932597502]
We present an automated pipeline to generate vocabulary tests using text from Wikipedia.
Our pipeline samples rare nouns and creates pseudowords with the same low-level statistics.
Our test, available in eight languages, can easily be extended to other languages.
arXiv Detail & Related papers (2023-02-03T09:27:12Z) - MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic
Parsing [48.216386761482525]
We present MultiSpider, the largest multilingual text-to- schema- dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese)
Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages.
We also propose a simple framework augmentation framework SAVe (Augmentation-with-Verification) which boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.
arXiv Detail & Related papers (2022-12-27T13:58:30Z) - Advancing Multilingual Pre-training: TRIP Triangular Document-level
Pre-training for Multilingual Language Models [107.83158521848372]
We present textbfTriangular Document-level textbfPre-training (textbfTRIP), which is the first in the field to accelerate the conventional monolingual and bilingual objectives into a trilingual objective with a novel method called Grafting.
TRIP achieves several strong state-of-the-art (SOTA) scores on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including consistent improvements by up to 3.11 d-BLEU points and 8.9 ROUGE-L points.
arXiv Detail & Related papers (2022-12-15T12:14:25Z) - Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation [133.7313847857935]
Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.
For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
arXiv Detail & Related papers (2022-03-17T16:48:22Z) - Feature Selection on Noisy Twitter Short Text Messages for Language
Identification [0.0]
We apply different feature selection algorithms across various learning algorithms in order to analyze the effect of the algorithm.
The methodology focuses on the word level language identification using a novel dataset of 6903 tweets extracted from Twitter.
arXiv Detail & Related papers (2020-07-11T09:22:01Z) - Text Complexity Classification Based on Linguistic Information:
Application to Intelligent Tutoring of ESL [0.0]
The goal of this work is to build a classifier that can identify text complexity within the context of teaching reading to English as a Second Language ( ESL) learners.
Using a corpus of 6171 texts, which had already been classified into three different levels of difficulty by ESL experts, different experiments were conducted with five machine learning algorithms.
The results showed that the adopted linguistic features provide a good overall classification performance.
arXiv Detail & Related papers (2020-01-07T02:42:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.