Related papers: WRAVAL -- WRiting Assist eVALuation

WRAVAL -- WRiting Assist eVALuation

URL: http://arxiv.org/abs/2601.03268v1
Date: Fri, 19 Dec 2025 09:21:27 GMT
Title: WRAVAL -- WRiting Assist eVALuation
Authors: Gabriel Benedict, Matthew Butler, Naved Merchant, Eetu Salama-Laine,
Abstract summary: Small Language Models (SLMs) typically score 3-4 times lower than Large Language Models (LLMs) on reasoning metrics.<n>We propose an evaluation framework specifically designed to highlight SLMs' capabilities in non-reasoning tasks.
Score: 7.441391098440092
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The emergence of Large Language Models (LLMs) has shifted language model evaluation toward reasoning and problem-solving tasks as measures of general intelligence. Small Language Models (SLMs) -- defined here as models under 10B parameters -- typically score 3-4 times lower than LLMs on these metrics. However, we demonstrate that these evaluations fail to capture SLMs' effectiveness in common industrial applications, such as tone modification tasks (e.g., funny, serious, professional). We propose an evaluation framework specifically designed to highlight SLMs' capabilities in non-reasoning tasks where predefined evaluation datasets don't exist. Our framework combines novel approaches in data generation, prompt-tuning, and LLM-based evaluation to demonstrate the potential of task-specific finetuning. This work provides practitioners with tools to effectively benchmark both SLMs and LLMs for practical applications, particularly in edge and private computing scenarios. Our implementation is available at: https://github.com/amazon-science/wraval.

Related papers

Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification [4.681300232651754]
Large language models (LLMs) show notable results in natural language processing (NLP) tasks for requirements engineering (RE)<n>In contrast, small language models (SLMs) offer a lightweight, locally deployable alternative.
arXiv Detail & Related papers (2025-10-24T13:20:30Z)
Beyond Next Word Prediction: Developing Comprehensive Evaluation Frameworks for measuring LLM performance on real world applications [3.686808512438363]
Large Language Models (LLMs) have numerous use-cases, and have already acquired a significant degree of enterprise adoption.<n>This paper provides the basis for a more comprehensive evaluation framework, based upon a traditional game and tool-based architecture.
arXiv Detail & Related papers (2025-03-05T06:44:38Z)
Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs [0.464982780843177]
This work investigates the ability of open Large Language Models (LLMs) to predict citation intent through in-context learning and fine-tuning.<n>We evaluate twelve model variations across five prominent open LLM families using zero-, one-, few-, and many-shot prompting.<n>We then demonstrate the significant impact of task-specific adaptation by fine-tuning this model, achieving a relative F1-score improvement of 8% on the SciCite dataset and 4.3% on the ACL-ARC dataset compared to the instruction-tuned baseline.
arXiv Detail & Related papers (2025-02-20T13:45:42Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework. We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z)
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [111.51612340032052]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.<n>This paper presents the first comprehensive MLLM Evaluation benchmark MME.<n>It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.