MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation
- URL: http://arxiv.org/abs/2410.12893v1
- Date: Wed, 16 Oct 2024 12:24:42 GMT
- Title: MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation
- Authors: Aniket Deroy, Subhankar Maity, Sudeshna Sarkar,
- Abstract summary: We propose a novel system, MIRROR, to automate the evaluation process for questions generated by automated question generation systems.
We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR.
- Score: 0.4857223913212445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson's correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.
Related papers
- The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models [53.12387628636912]
We propose an automatic evaluation framework that is validated against human annotations.
This approach was originally developed for the TREC Question Answering (QA) Track in 2003.
We observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants.
arXiv Detail & Related papers (2025-04-21T12:55:06Z) - STRIVE: A Think & Improve Approach with Iterative Refinement for Enhancing Question Quality Estimation [0.0]
We propose a novel methodology called STRIVE using a series of Large Language Models (LLMs) for automatic question evaluation.
The method estimates question quality in an automated manner by generating multiple evaluations based on the strengths and weaknesses of the provided question.
arXiv Detail & Related papers (2025-04-08T05:34:38Z) - UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation [36.40760924116748]
Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA)
Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images.
We propose an Unsupervised Peer review MLLM Evaluation framework, which allows models to automatically generate questions and conduct peer review assessments of answers from other models.
arXiv Detail & Related papers (2025-03-19T07:15:41Z) - SedarEval: Automated Evaluation using Self-Adaptive Rubrics [4.97150240417381]
We propose a new evaluation paradigm based on self-adaptive rubrics.
SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric.
We train a specialized evaluator language model (evaluator LM) to supplant human graders.
arXiv Detail & Related papers (2025-01-26T16:45:09Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)
In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.
We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset.
We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6.
Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z) - An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases.
In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits.
Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering [10.338962367542331]
We introduce an automatic evaluation framework IQA-EVAL to achieve Interactive Question Answering Evaluations.
We also introduce a LLM-based Evaluation Agent (LEA) that can simulate human behaviors to generate interactions with IQA models.
We show that our evaluation framework with GPT-4 as the backbone model achieves a high correlation with human evaluations on the IQA task.
arXiv Detail & Related papers (2024-08-24T10:34:20Z) - Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations.
We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer.
Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z) - C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - QAScore -- An Unsupervised Unreferenced Metric for the Question
Generation Evaluation [6.697751970080859]
Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers.
We propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore.
arXiv Detail & Related papers (2022-10-09T19:00:39Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.