Cross-corpus Readability Compatibility Assessment for English Texts
- URL: http://arxiv.org/abs/2306.09704v2
- Date: Wed, 13 Sep 2023 09:43:28 GMT
- Title: Cross-corpus Readability Compatibility Assessment for English Texts
- Authors: Zhenzhen Li, Han Ding, Shaohong Zhang
- Abstract summary: We propose a novel evaluation framework, Cross-corpus text Readability Compatibility Assessment.
The framework encompasses three key components: Corpus: CEFR, CLEC, CLOTH, NES, OSP, and RACE.
Our findings revealed that OSP stood out as significantly different from other datasets.
- Score: 6.225179315266989
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text readability assessment has gained significant attention from researchers
in various domains. However, the lack of exploration into corpus compatibility
poses a challenge as different research groups utilize different corpora. In
this study, we propose a novel evaluation framework, Cross-corpus text
Readability Compatibility Assessment (CRCA), to address this issue. The
framework encompasses three key components: (1) Corpus: CEFR, CLEC, CLOTH, NES,
OSP, and RACE. Linguistic features, GloVe word vector representations, and
their fusion features were extracted. (2) Classification models: Machine
learning methods (XGBoost, SVM) and deep learning methods (BiLSTM,
Attention-BiLSTM) were employed. (3) Compatibility metrics: RJSD, RRNSS, and
NDCG metrics. Our findings revealed: (1) Validated corpus compatibility, with
OSP standing out as significantly different from other datasets. (2) An
adaptation effect among corpora, feature representations, and classification
methods. (3) Consistent outcomes across the three metrics, validating the
robustness of the compatibility assessment framework. The outcomes of this
study offer valuable insights into corpus selection, feature representation,
and classification methods, and it can also serve as a beginning effort for
cross-corpus transfer learning.
Related papers
- IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios [14.336896748878921]
This paper introduces the IRSC benchmark for evaluating the performance of embedding models in multilingual RAG tasks.
The benchmark encompasses five retrieval tasks: query retrieval, title retrieval, part-of-paragraph retrieval, keyword retrieval, and summary retrieval.
Our contributions include: 1) the IRSC benchmark, 2) the SSCI and RCCI metrics, and 3) insights into the cross-lingual limitations of embedding models.
arXiv Detail & Related papers (2024-09-24T05:39:53Z) - Influence of various text embeddings on clustering performance in NLP [0.0]
A clustering approach can be used to relabel the correct star ratings by grouping the text reviews into individual groups.
In this work, we explore the task of choosing different text embeddings to represent these reviews and also explore the impact the embedding choice has on the performance of various classes of clustering algorithms.
arXiv Detail & Related papers (2023-05-04T20:53:19Z) - Evaluating BERT-based Scientific Relation Classifiers for Scholarly
Knowledge Graph Construction on Digital Library Collections [5.8962650619804755]
Inferring semantic relations between related scientific concepts is a crucial step.
BERT-based pre-trained models have been popularly explored for automatic relation classification.
Existing methods are primarily evaluated on clean texts.
To address these limitations, we started by creating OCR-noisy texts.
arXiv Detail & Related papers (2023-05-03T17:32:16Z) - UniTE: Unified Translation Evaluation [63.58868113074476]
UniTE is the first unified framework engaged with abilities to handle all three evaluation tasks.
We testify our framework on WMT 2019 Metrics and WMT 2020 Quality Estimation benchmarks.
arXiv Detail & Related papers (2022-04-28T08:35:26Z) - Generalizing Cross-Document Event Coreference Resolution Across Multiple
Corpora [63.429307282665704]
Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents.
CDCR aims to benefit downstream multi-document applications, but improvements from applying CDCR have not been shown yet.
We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus.
arXiv Detail & Related papers (2020-11-24T17:45:03Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z) - Universal Weighting Metric Learning for Cross-Modal Matching [79.32133554506122]
Cross-modal matching has been a highlighted research topic in both vision and language areas.
We propose a simple and interpretable universal weighting framework for cross-modal matching.
arXiv Detail & Related papers (2020-10-07T13:16:45Z) - Symbiotic Adversarial Learning for Attribute-based Person Search [86.7506832053208]
We present a symbiotic adversarial learning framework, called SAL.Two GANs sit at the base of the framework in a symbiotic learning scheme.
Specifically, two different types of generative adversarial networks learn collaboratively throughout the training process.
arXiv Detail & Related papers (2020-07-19T07:24:45Z) - A Novel Attention-based Aggregation Function to Combine Vision and
Language [55.7633883960205]
We propose a novel fully-attentive reduction method for vision and language.
Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention.
We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices.
arXiv Detail & Related papers (2020-04-27T18:09:46Z) - Compass-aligned Distributional Embeddings for Studying Semantic
Differences across Corpora [14.993021283916008]
We present a framework to support cross-corpora language studies with word embeddings.
CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora.
The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available.
arXiv Detail & Related papers (2020-04-13T15:46:47Z) - Text Complexity Classification Based on Linguistic Information:
Application to Intelligent Tutoring of ESL [0.0]
The goal of this work is to build a classifier that can identify text complexity within the context of teaching reading to English as a Second Language ( ESL) learners.
Using a corpus of 6171 texts, which had already been classified into three different levels of difficulty by ESL experts, different experiments were conducted with five machine learning algorithms.
The results showed that the adopted linguistic features provide a good overall classification performance.
arXiv Detail & Related papers (2020-01-07T02:42:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.