Related papers: LLMCRIT: Teaching Large Language Models to Use Criteria

LLMCRIT: Teaching Large Language Models to Use Criteria

URL: http://arxiv.org/abs/2403.01069v2
Date: Tue, 4 Jun 2024 15:30:20 GMT
Title: LLMCRIT: Teaching Large Language Models to Use Criteria
Authors: Weizhe Yuan, Pengfei Liu, Matthias Gallé,
Abstract summary: We propose a framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. The results reveal the fine-grained effects of incorporating criteria and demonstrations and provide valuable insights on how to teach LLMs to use criteria more effectively.
Score: 38.12026374220591
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Humans follow criteria when they execute tasks, and these criteria are directly used to assess the quality of task completion. Therefore, having models learn to use criteria to provide feedback can help humans or models to perform tasks better. However, existing research in this field tends to consider only a limited set of criteria or quality assessment aspects. To fill this gap, we propose a general framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. We choose three tasks from real-world scenarios to operationalize this idea: paper introduction writing, Python code writing, and Reddit post writing, and evaluate our feedback generation framework using different LLMs. The results reveal the fine-grained effects of incorporating criteria and demonstrations and provide valuable insights on how to teach LLMs to use criteria more effectively.

Related papers

EvalAgent: Discovering Implicit Evaluation Criteria from the Web [82.82096383262068]
We introduce EvalAgent, a framework designed to automatically uncover nuanced and task-specific criteria. EvalAgent mines expert-authored online guidance to propose diverse, long-tail evaluation criteria. Our experiments demonstrate that the grounded criteria produced by EvalAgent are often implicit, yet specific.
arXiv Detail & Related papers (2025-04-21T16:43:50Z)
Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models [42.62148712511799]
Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. Existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task. We propose HarmonicEval, a comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner.
arXiv Detail & Related papers (2024-12-19T08:03:16Z)
Large Language Models Are Active Critics in NLG Evaluation [9.932334723464129]
Active-Critic is a novel evaluator that transforms large language models (LLMs) into "active critics" Our experiments show that Active-Critic can generate nuanced, context-aware evaluation criteria.
arXiv Detail & Related papers (2024-10-14T17:04:41Z)
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA) Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents. We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z)
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries [33.39343288446156]
We propose report cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics. We develop a framework to evaluate report cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans).
arXiv Detail & Related papers (2024-09-01T21:18:14Z)
TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot [2.186726107112913]
We propose a model-based evaluation method: TALEC. It allows users to flexibly set their own evaluation criteria, and uses in-context learning (ICL) to teach judge model these in-house criteria. TALEC demonstrates a strong capability to accurately reflect human preferences and achieves a correlation of over 80% with human judgments.
arXiv Detail & Related papers (2024-06-25T10:02:42Z)
OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations. We identify and review the varying factors in evaluation practices adopted by the community. OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z)
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences [11.23629471911503]
EvalGen provides automated assistance to users in generating evaluation criteria and implementing assertions. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. We identify a phenomenon we dub emphcriteria drift: users need criteria to grade outputs, but grading outputs helps users define criteria.
arXiv Detail & Related papers (2024-04-18T15:45:27Z)
Towards Generalist Prompting for Large Language Models by Mental Models [105.03747314550591]
Large language models (LLMs) have demonstrated impressive performance on many tasks. To achieve optimal performance, specially designed prompting methods are still needed. We introduce the concept of generalist prompting, which operates on the design principle of achieving optimal or near-optimal performance.
arXiv Detail & Related papers (2024-02-28T11:29:09Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans. This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning) We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.