Flesch or Fumble? Evaluating Readability Standard Alignment of
Instruction-Tuned Language Models
- URL: http://arxiv.org/abs/2309.05454v2
- Date: Fri, 3 Nov 2023 21:23:06 GMT
- Title: Flesch or Fumble? Evaluating Readability Standard Alignment of
Instruction-Tuned Language Models
- Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi
- Abstract summary: We select a diverse set of open and closed-source instruction-tuned language models and investigate their performances in writing story completions and simplifying narratives.
Our findings provide empirical proof of how globally recognized models like ChatGPT may be considered less effective and may require more refined prompts for these generative tasks.
- Score: 4.867923281108005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Readability metrics and standards such as Flesch Kincaid Grade Level (FKGL)
and the Common European Framework of Reference for Languages (CEFR) exist to
guide teachers and educators to properly assess the complexity of educational
materials before administering them for classroom use. In this study, we select
a diverse set of open and closed-source instruction-tuned language models and
investigate their performances in writing story completions and simplifying
narratives--tasks that teachers perform--using standard-guided prompts
controlling text readability. Our extensive findings provide empirical proof of
how globally recognized models like ChatGPT may be considered less effective
and may require more refined prompts for these generative tasks compared to
other open-sourced models such as BLOOMZ and FlanT5--which have shown promising
results.
Related papers
- Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages.
This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z) - Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation [3.666326242924816]
We introduce Standardize, a retrieval-style in-context learning-based framework to guide large language models to align with expert-defined standards.
Our findings show that models can gain a 45% to 100% increase in precise accuracy across open and commercial LLMs evaluated.
arXiv Detail & Related papers (2024-02-19T23:18:18Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Controllable Speaking Styles Using a Large Language Model [13.642358232817342]
Text-to-Speech (TTS) models can generate multiple, prosodically-different renditions of the same target text.
Currently, controlling these models during inference typically requires finding an appropriate reference utterance.
Here, we give two demonstrations: control of speaking style; prosody appropriate for a given dialogue context.
arXiv Detail & Related papers (2023-05-17T16:01:50Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats.
We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes.
We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Dependency Induction Through the Lens of Visual Perception [81.91502968815746]
We propose an unsupervised grammar induction model that leverages word concreteness and a structural vision-based to jointly learn constituency-structure and dependency-structure grammars.
Our experiments show that the proposed extension outperforms the current state-of-the-art visually grounded models in constituency parsing even with a smaller grammar size.
arXiv Detail & Related papers (2021-09-20T18:40:37Z) - Prompt-Learning for Fine-Grained Entity Typing [40.983849729537795]
We investigate the application of prompt-learning on fine-grained entity typing in fully supervised, few-shot and zero-shot scenarios.
We propose a self-supervised strategy that carries out distribution-level optimization in prompt-learning to automatically summarize the information of entity types.
arXiv Detail & Related papers (2021-08-24T09:39:35Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.