Towards Arabic Sentence Simplification via Classification and Generative
Approaches
- URL: http://arxiv.org/abs/2204.09292v1
- Date: Wed, 20 Apr 2022 08:17:33 GMT
- Title: Towards Arabic Sentence Simplification via Classification and Generative
Approaches
- Authors: Nouran Khallaf, Serge Sharoff
- Abstract summary: This paper presents an attempt to build a Modern Standard Arabic (MSA) sentence-level simplification system.
We experimented with sentence simplification using two approaches: (i) a classification approach leading to lexical simplification pipelines which use Arabic-BERT, a pre-trained contextualised model, as well as a model of fastText word embeddings; and (ii) a generative approach, a Seq2Seq technique by applying a multilingual Text-to-Text Transfer Transformer mT5.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an attempt to build a Modern Standard Arabic (MSA)
sentence-level simplification system. We experimented with sentence
simplification using two approaches: (i) a classification approach leading to
lexical simplification pipelines which use Arabic-BERT, a pre-trained
contextualised model, as well as a model of fastText word embeddings; and (ii)
a generative approach, a Seq2Seq technique by applying a multilingual
Text-to-Text Transfer Transformer mT5. We developed our training corpus by
aligning the original and simplified sentences from the internationally
acclaimed Arabic novel "Saaq al-Bambuu". We evaluate effectiveness of these
methods by comparing the generated simple sentences to the target simple
sentences using the BERTScore evaluation metric. The simple sentences produced
by the mT5 model achieve P 0.72, R 0.68 and F-1 0.70 via BERTScore, while,
combining Arabic-BERT and fastText achieves P 0.97, R 0.97 and F-1 0.97. In
addition, we report a manual error analysis for these experiments.
\url{https://github.com/Nouran-Khallaf/Lexical_Simplification}
Related papers
- A New Dataset and Empirical Study for Sentence Simplification in Chinese [50.0624778757462]
This paper introduces CSS, a new dataset for assessing sentence simplification in Chinese.
We collect manual simplifications from human annotators and perform data analysis to show the difference between English and Chinese sentence simplifications.
In the end, we explore whether Large Language Models can serve as high-quality Chinese sentence simplification systems by evaluating them on CSS.
arXiv Detail & Related papers (2023-06-07T06:47:34Z) - SimpLex: a lexical text simplification architecture [0.5156484100374059]
We present textscSimpLex, a novel simplification architecture for generating simplified English sentences.
The proposed architecture uses either word embeddings (i.e., Word2Vec) and perplexity, or sentence transformers (i.e., BERT, RoBERTa, and GPT2) and cosine similarity.
The solution is incorporated into a user-friendly and simple-to-use software.
arXiv Detail & Related papers (2023-04-14T08:52:31Z) - NapSS: Paragraph-level Medical Text Simplification via Narrative
Prompting and Sentence-matching Summarization [46.772517928718216]
We propose a summarize-then-simplify two-stage strategy, which we call NapSS.
NapSS identifies the relevant content to simplify while ensuring that the original narrative flow is preserved.
Our model achieves significantly better than the seq2seq baseline on an English medical corpus.
arXiv Detail & Related papers (2023-02-11T02:20:25Z) - Classifiers are Better Experts for Controllable Text Generation [63.17266060165098]
We show that the proposed method significantly outperforms recent PPLM, GeDi, and DExperts on PPL and sentiment accuracy based on the external classifier of generated texts.
The same time, it is also easier to implement and tune, and has significantly fewer restrictions and requirements.
arXiv Detail & Related papers (2022-05-15T12:58:35Z) - Phrase-level Active Learning for Neural Machine Translation [107.28450614074002]
We propose an active learning setting where we can spend a given budget on translating in-domain data.
We select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators.
In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods.
arXiv Detail & Related papers (2021-06-21T19:20:42Z) - Automatic Difficulty Classification of Arabic Sentences [0.0]
The accuracy of our 3-way CEFR classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification respectively and 0.71 Spearman correlation for regression.
We compare the use of sentence embeddings of different kinds (fastText, mBERT, XLM-R and Arabic-BERT) as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners.
arXiv Detail & Related papers (2021-03-07T16:02:04Z) - Hopeful_Men@LT-EDI-EACL2021: Hope Speech Detection Using Indic
Transliteration and Transformers [6.955778726800376]
This paper describes the approach we used to detect hope speech in the HopeEDI dataset.
In the first approach, we used contextual embeddings to train classifiers using logistic regression, random forest, SVM, and LSTM based models.
The second approach involved using a majority voting ensemble of 11 models which were obtained by fine-tuning pre-trained transformer models.
arXiv Detail & Related papers (2021-02-24T06:01:32Z) - Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings [51.47607125262885]
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text.
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training.
We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
arXiv Detail & Related papers (2020-10-15T14:04:03Z) - Neural CRF Model for Sentence Alignment in Text Simplification [31.62648025127563]
We create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia.
Experiments demonstrate that our proposed approach outperforms all the previous work on monolingual sentence alignment task by more than 5 points in F1.
A Transformer-based seq2seq model trained on our datasets establishes a new state-of-the-art for text simplification in both automatic and human evaluation.
arXiv Detail & Related papers (2020-05-05T16:47:51Z) - ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification
Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English.
We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.