Little Giants: Exploring the Potential of Small LLMs as Evaluation
  Metrics in Summarization in the Eval4NLP 2023 Shared Task
        - URL: http://arxiv.org/abs/2311.00686v1
- Date: Wed, 1 Nov 2023 17:44:35 GMT
- Title: Little Giants: Exploring the Potential of Small LLMs as Evaluation
  Metrics in Summarization in the Eval4NLP 2023 Shared Task
- Authors: Neema Kotonya and Saran Krishnasamy and Joel Tetreault and Alejandro
  Jaimes
- Abstract summary: This paper focuses on assessing the effectiveness of prompt-based techniques to empower Large Language Models to handle the task of quality estimation.
We conducted systematic experiments with various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting.
Our work reveals that combining these approaches using a "small", open source model (orca_mini_v3_7B) yields competitive results.
- Score: 53.163534619649866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   This paper describes and analyzes our participation in the 2023 Eval4NLP
shared task, which focuses on assessing the effectiveness of prompt-based
techniques to empower Large Language Models to handle the task of quality
estimation, particularly in the context of evaluating machine translations and
summaries. We conducted systematic experiments with various prompting
techniques, including standard prompting, prompts informed by annotator
instructions, and innovative chain-of-thought prompting. In addition, we
integrated these approaches with zero-shot and one-shot learning methods to
maximize the efficacy of our evaluation procedures. Our work reveals that
combining these approaches using a "small", open source model (orca_mini_v3_7B)
yields competitive results.
 
      
        Related papers
        - Teaching Language Models To Gather Information Proactively [53.85419549904644]
 Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
 arXiv  Detail & Related papers  (2025-07-28T23:50:09Z)
- Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text   Generation with Uncertainty-Based Active Learning [63.531262595858]
 Divide-and-conquer approach breaks comprehensive evaluation task into localized scoring tasks, followed by a final global assessment.<n>We introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations.<n>Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation.
 arXiv  Detail & Related papers  (2025-05-26T16:39:41Z)
- PanguIR Technical Report for NTCIR-18 AEOLLM Task [12.061652026366591]
 Large language models (LLMs) are increasingly critical and challenging to evaluate.
Manual evaluation, while comprehensive, is often costly and resource-intensive.
automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria.
 arXiv  Detail & Related papers  (2025-03-04T07:40:02Z)
- Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of   Free-Form Text [12.879551933541345]
 Large Language Models (LLMs) are capable of generating human-like conversations.
 Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs.
We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
 arXiv  Detail & Related papers  (2024-08-17T16:01:45Z)
- Information-Theoretic Distillation for Reference-less Summarization [67.51150817011617]
 We present a novel framework to distill a powerful summarizer based on the information-theoretic objective for summarization.
We start off from Pythia-2.8B as the teacher model, which is not yet capable of summarization.
We arrive at a compact but powerful summarizer with only 568M parameters that performs competitively against ChatGPT.
 arXiv  Detail & Related papers  (2024-03-20T17:42:08Z)
- C-ICL: Contrastive In-context Learning for Information Extraction [54.39470114243744]
 c-ICL is a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations.
Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods.
 arXiv  Detail & Related papers  (2024-02-17T11:28:08Z)
- Sharing Knowledge in Multi-Task Deep Reinforcement Learning [57.38874587065694]
 We study the benefit of sharing representations among tasks to enable the effective use of deep neural networks in Multi-Task Reinforcement Learning.
We prove this by providing theoretical guarantees that highlight the conditions for which is convenient to share representations among tasks.
 arXiv  Detail & Related papers  (2024-01-17T19:31:21Z)
- Exploring Prompting Large Language Models as Explainable Metrics [0.0]
 We propose a zero-shot prompt-based strategy for explainable evaluation of the summarization task using Large Language Models (LLMs)
The conducted experiments demonstrate the promising potential of LLMs as evaluation metrics in Natural Language Processing (NLP)
The performance of our best provided prompts achieved a Kendall correlation of 0.477 with human evaluations in the text summarization task on the test data.
 arXiv  Detail & Related papers  (2023-11-20T06:06:22Z)
- Which is better? Exploring Prompting Strategy For LLM-based Metrics [6.681126871165601]
 This paper describes the DSBA submissions to the Prompting Large Language Models as Explainable Metrics shared task.
Traditional similarity-based metrics such as BLEU and ROUGE have shown to misalign with human evaluation and are ill-suited for open-ended generation tasks.
 arXiv  Detail & Related papers  (2023-11-07T06:36:39Z)
- The Eval4NLP 2023 Shared Task on Prompting Large Language Models as
  Explainable Metrics [36.52897053496835]
 generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples.
We introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation.
We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset.
 arXiv  Detail & Related papers  (2023-10-30T17:55:08Z)
- Comparing Methods for Extractive Summarization of Call Centre Dialogue [77.34726150561087]
 We experimentally compare several such methods by using them to produce summaries of calls, and evaluating these summaries objectively.
We found that TopicSum and Lead-N outperform the other summarisation methods, whilst BERTSum received comparatively lower scores in both subjective and objective evaluations.
 arXiv  Detail & Related papers  (2022-09-06T13:16:02Z)
- Making Pre-trained Language Models Better Few-shot Learners [11.90626040104822]
 Recent GPT-3 model achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context.
Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient.
We present LM-BFF--better few-shot fine-tuning of language models--a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples.
 arXiv  Detail & Related papers  (2020-12-31T17:21:26Z)
- Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
  Learning [66.30909748400023]
 We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
 Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
 arXiv  Detail & Related papers  (2020-10-05T05:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.