HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions
- URL: http://arxiv.org/abs/2502.00857v1
- Date: Sun, 02 Feb 2025 17:07:18 GMT
- Title: HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions
- Authors: Jamshid Mozafari, Bhawna Piryani, Abdelrahman Abdallah, Adam Jatowt,
- Abstract summary: HintEval is a Python library that makes it easy to access diverse datasets and provides multiple approaches to generate and evaluate hints.
By reducing barriers to entry, HintEval offers a major step forward for facilitating hint generation and analysis research within the NLP/IR community.
- Score: 16.434748534272014
- License:
- Abstract: Large Language Models (LLMs) are transforming how people find information, and many users turn nowadays to chatbots to obtain answers to their questions. Despite the instant access to abundant information that LLMs offer, it is still important to promote critical thinking and problem-solving skills. Automatic hint generation is a new task that aims to support humans in answering questions by themselves by creating hints that guide users toward answers without directly revealing them. In this context, hint evaluation focuses on measuring the quality of hints, helping to improve the hint generation approaches. However, resources for hint research are currently spanning different formats and datasets, while the evaluation tools are missing or incompatible, making it hard for researchers to compare and test their models. To overcome these challenges, we introduce HintEval, a Python library that makes it easy to access diverse datasets and provides multiple approaches to generate and evaluate hints. HintEval aggregates the scattered resources into a single toolkit that supports a range of research goals and enables a clear, multi-faceted, and reliable evaluation. The proposed library also includes detailed online documentation, helping users quickly explore its features and get started. By reducing barriers to entry and encouraging consistent evaluation practices, HintEval offers a major step forward for facilitating hint generation and analysis research within the NLP/IR community.
Related papers
- WikiHint: A Human-Annotated Dataset for Hint Ranking and Generation [15.144785147549713]
We first introduce a manually constructed hint dataset, WikiHint, which is based on Wikipedia and includes 5,000 hints created for 1,000 questions.
We then finetune open-source LLMs such as LLaMA-3.1 for hint generation in answer-aware and answeragnostic contexts.
We assess the effectiveness of the hints with human participants who answer questions with and without the aid of hints.
arXiv Detail & Related papers (2024-12-02T15:44:19Z) - Enhancing Answer Attribution for Faithful Text Generation with Large Language Models [5.065947993017158]
We propose new methods for producing more independent and contextualized claims for better retrieval and attribution.
New methods are evaluated and shown to improve the performance of answer attribution components.
arXiv Detail & Related papers (2024-10-22T15:37:46Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales.
A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z) - TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions [20.510164413931577]
We introduce a framework for the automatic hint generation for factoid questions.
We construct a novel large-scale dataset featuring 160,230 hints corresponding to 16,645 questions from the TriviaQA dataset.
To evaluate the TriviaHG dataset and the proposed evaluation method, we enlisted 10 individuals to annotate 2,791 hints and tasked 6 humans with answering questions using the provided hints.
arXiv Detail & Related papers (2024-03-27T10:27:28Z) - Next-Step Hint Generation for Introductory Programming Using Large
Language Models [0.8002196839441036]
Large Language Models possess skills such as answering questions, writing essays or solving programming exercises.
This work explores how LLMs can contribute to programming education by supporting students with automated next-step hints.
arXiv Detail & Related papers (2023-12-03T17:51:07Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Self-Knowledge Guided Retrieval Augmentation for Large Language Models [59.771098292611846]
Large language models (LLMs) have shown superior performance without task-specific fine-tuning.
Retrieval-based methods can offer non-parametric world knowledge and improve the performance on tasks such as question answering.
Self-Knowledge guided Retrieval augmentation (SKR) is a simple yet effective method which can let LLMs refer to the questions they have previously encountered.
arXiv Detail & Related papers (2023-10-08T04:22:33Z) - Question Answering Survey: Directions, Challenges, Datasets, Evaluation
Matrices [0.0]
The research directions of QA field are analyzed based on the type of question, answer type, source of evidence-answer, and modeling approach.
This detailed followed by open challenges of the field like automatic question generation, similarity detection and, low resource availability for a language.
arXiv Detail & Related papers (2021-12-07T08:53:40Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.