CEFR-Based Sentence Difficulty Annotation and Assessment
- URL: http://arxiv.org/abs/2210.11766v1
- Date: Fri, 21 Oct 2022 07:03:30 GMT
- Title: CEFR-Based Sentence Difficulty Annotation and Assessment
- Authors: Yuki Arase, Satoru Uchida, Tomoyuki Kajiwara
- Abstract summary: The CEFR-based Sentence Profile (CEFR-SP) corpus contains 17k English sentences annotated with the levels based on the Common European Framework of Reference for Languages.
In the experiments in this study, our method achieved a macro-F1 score of 84.5% in the level assessment, thus outperforming strong baselines employed in readability assessment.
- Score: 25.71796445061561
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Controllable text simplification is a crucial assistive technique for
language learning and teaching. One of the primary factors hindering its
advancement is the lack of a corpus annotated with sentence difficulty levels
based on language ability descriptions. To address this problem, we created the
CEFR-based Sentence Profile (CEFR-SP) corpus, containing 17k English sentences
annotated with the levels based on the Common European Framework of Reference
for Languages assigned by English-education professionals. In addition, we
propose a sentence-level assessment model to handle unbalanced level
distribution because the most basic and highly proficient sentences are
naturally scarce. In the experiments in this study, our method achieved a
macro-F1 score of 84.5% in the level assessment, thus outperforming strong
baselines employed in readability assessment.
Related papers
- Guidelines for Fine-grained Sentence-level Arabic Readability Annotation [9.261022921574318]
The Balanced Arabic Readability Evaluation Corpus (BAREC) project is designed to address the need for comprehensive Arabic language resources aligned with diverse readability levels.
Inspired by the Taha/Arabi21 readability reference, BAREC aims to provide a standardized reference for assessing sentence-level Arabic text readability across 19 distinct levels.
This paper focuses on our meticulous annotation guidelines, demonstrated through the analysis of 10,631 sentences/phrases (113,651 words)
arXiv Detail & Related papers (2024-10-11T09:59:46Z) - Strategies for Arabic Readability Modeling [9.976720880041688]
Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility.
We present a set of experimental results on Arabic readability assessment using a diverse range of approaches.
arXiv Detail & Related papers (2024-07-03T11:54:11Z) - ICLEval: Evaluating In-Context Learning Ability of Large Language Models [68.7494310749199]
In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs.
Existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability.
We introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning.
arXiv Detail & Related papers (2024-06-21T08:06:10Z) - On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation [71.72465617754553]
We generate "low-level" sentences that convey object-centric, three-dimensional spatial relationships, incorporate them as additional language priors and evaluate their downstream impact on depth estimation.
Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions.
Despite leveraging additional data, these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift.
arXiv Detail & Related papers (2024-04-12T15:35:20Z) - Pragmatic Competence Evaluation of Large Language Models for the Korean Language [0.6757476692230009]
This study evaluates how well Large Language Models (LLMs) understand context-dependent expressions from a pragmatic standpoint, specifically in Korean.
We use both Multiple-Choice Questions (MCQs) for automatic evaluation and Open-Ended Questions (OEQs) assessed by human experts.
arXiv Detail & Related papers (2024-03-19T12:21:20Z) - Analyzing and Adapting Large Language Models for Few-Shot Multilingual
NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning.
We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups.
Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv Detail & Related papers (2024-03-04T10:48:13Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - VALUE: Understanding Dialect Disparity in NLU [50.35526025326337]
We construct rules for 11 features of African American Vernacular English (AAVE)
We recruit fluent AAVE speakers to validate each feature transformation via linguistic acceptability judgments.
Experiments show that these new dialectal features can lead to a drop in model performance.
arXiv Detail & Related papers (2022-04-06T18:30:56Z) - Monolingual and Cross-Lingual Acceptability Judgments with the Italian
CoLA corpus [2.418273287232718]
We describe the ItaCoLA corpus, containing almost 10,000 sentences with acceptability judgments.
We also present the first cross-lingual experiments, aimed at assessing whether multilingual transformerbased approaches can benefit from using sentences in two languages during fine-tuning.
arXiv Detail & Related papers (2021-09-24T16:18:53Z) - Deep learning for sentence clustering in essay grading support [1.7259867886009057]
We introduce two datasets of undergraduate student essays in Finnish, manually annotated for salient arguments on the sentence level.
We evaluate several deep-learning embedding methods for their suitability to sentence clustering in support of essay grading.
arXiv Detail & Related papers (2021-04-23T12:32:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.