LLMCRIT: Teaching Large Language Models to Use Criteria
- URL: http://arxiv.org/abs/2403.01069v2
- Date: Tue, 4 Jun 2024 15:30:20 GMT
- Title: LLMCRIT: Teaching Large Language Models to Use Criteria
- Authors: Weizhe Yuan, Pengfei Liu, Matthias Gallé,
- Abstract summary: We propose a framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution.
In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion.
The results reveal the fine-grained effects of incorporating criteria and demonstrations and provide valuable insights on how to teach LLMs to use criteria more effectively.
- Score: 38.12026374220591
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Humans follow criteria when they execute tasks, and these criteria are directly used to assess the quality of task completion. Therefore, having models learn to use criteria to provide feedback can help humans or models to perform tasks better. However, existing research in this field tends to consider only a limited set of criteria or quality assessment aspects. To fill this gap, we propose a general framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. We choose three tasks from real-world scenarios to operationalize this idea: paper introduction writing, Python code writing, and Reddit post writing, and evaluate our feedback generation framework using different LLMs. The results reveal the fine-grained effects of incorporating criteria and demonstrations and provide valuable insights on how to teach LLMs to use criteria more effectively.
Related papers
- Large Language Models Are Active Critics in NLG Evaluation [9.932334723464129]
Active-Critic is a novel evaluator that transforms large language models (LLMs) into "active critics"
Our experiments show that Active-Critic can generate nuanced, context-aware evaluation criteria.
arXiv Detail & Related papers (2024-10-14T17:04:41Z) - Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA)
Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents.
We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z) - Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries [33.39343288446156]
We propose report cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics.
We develop a framework to evaluate report cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans).
arXiv Detail & Related papers (2024-09-01T21:18:14Z) - TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot [2.186726107112913]
We propose a model-based evaluation method: TALEC.
It allows users to flexibly set their own evaluation criteria, and uses in-context learning (ICL) to teach judge model these in-house criteria.
TALEC demonstrates a strong capability to accurately reflect human preferences and achieves a correlation of over 80% with human judgments.
arXiv Detail & Related papers (2024-06-25T10:02:42Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
OLMES is a documented, practical, open standard for reproducible language model evaluations.
It supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
OLMES includes well-considered, documented recommendations guided by results from existing literature as well as new experiments resolving open questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.
This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)
We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.