Related papers: HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions

HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions

URL: http://arxiv.org/abs/2502.00857v1
Date: Sun, 02 Feb 2025 17:07:18 GMT
Title: HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions
Authors: Jamshid Mozafari, Bhawna Piryani, Abdelrahman Abdallah, Adam Jatowt,
Abstract summary: HintEval is a Python library that makes it easy to access diverse datasets and provides multiple approaches to generate and evaluate hints.<n>By reducing barriers to entry, HintEval offers a major step forward for facilitating hint generation and analysis research within the NLP/IR community.
Score: 16.434748534272014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are transforming how people find information, and many users turn nowadays to chatbots to obtain answers to their questions. Despite the instant access to abundant information that LLMs offer, it is still important to promote critical thinking and problem-solving skills. Automatic hint generation is a new task that aims to support humans in answering questions by themselves by creating hints that guide users toward answers without directly revealing them. In this context, hint evaluation focuses on measuring the quality of hints, helping to improve the hint generation approaches. However, resources for hint research are currently spanning different formats and datasets, while the evaluation tools are missing or incompatible, making it hard for researchers to compare and test their models. To overcome these challenges, we introduce HintEval, a Python library that makes it easy to access diverse datasets and provides multiple approaches to generate and evaluate hints. HintEval aggregates the scattered resources into a single toolkit that supports a range of research goals and enables a clear, multi-faceted, and reliable evaluation. The proposed library also includes detailed online documentation, helping users quickly explore its features and get started. By reducing barriers to entry and encouraging consistent evaluation practices, HintEval offers a major step forward for facilitating hint generation and analysis research within the NLP/IR community.

Related papers

Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique [66.94905631175209]
We propose a novel inference-time scaling approach -- stepwise natural language self-critique (PANEL) It employs self-generated natural language critiques as feedback to guide the step-level search process. This approach bypasses the need for task-specific verifiers and the associated training overhead.
arXiv Detail & Related papers (2025-03-21T17:59:55Z)
WikiHint: A Human-Annotated Dataset for Hint Ranking and Generation [15.144785147549713]
We first introduce a manually constructed hint dataset, WikiHint, which is based on Wikipedia and includes 5,000 hints created for 1,000 questions.<n>We then finetune open-source LLMs such as LLaMA-3.1 for hint generation in answer-aware and answeragnostic contexts.<n>We assess the effectiveness of the hints with human participants who answer questions with and without the aid of hints.
arXiv Detail & Related papers (2024-12-02T15:44:19Z)
Enhancing Answer Attribution for Faithful Text Generation with Large Language Models [5.065947993017158]
We propose new methods for producing more independent and contextualized claims for better retrieval and attribution. New methods are evaluated and shown to improve the performance of answer attribution components.
arXiv Detail & Related papers (2024-10-22T15:37:46Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z)
TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions [20.510164413931577]
We introduce a framework for the automatic hint generation for factoid questions. We construct a novel large-scale dataset featuring 160,230 hints corresponding to 16,645 questions from the TriviaQA dataset. To evaluate the TriviaHG dataset and the proposed evaluation method, we enlisted 10 individuals to annotate 2,791 hints and tasked 6 humans with answering questions using the provided hints.
arXiv Detail & Related papers (2024-03-27T10:27:28Z)
Next-Step Hint Generation for Introductory Programming Using Large Language Models [0.8002196839441036]
Large Language Models possess skills such as answering questions, writing essays or solving programming exercises. This work explores how LLMs can contribute to programming education by supporting students with automated next-step hints.
arXiv Detail & Related papers (2023-12-03T17:51:07Z)
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
Self-Knowledge Guided Retrieval Augmentation for Large Language Models [59.771098292611846]
Large language models (LLMs) have shown superior performance without task-specific fine-tuning. Retrieval-based methods can offer non-parametric world knowledge and improve the performance on tasks such as question answering. Self-Knowledge guided Retrieval augmentation (SKR) is a simple yet effective method which can let LLMs refer to the questions they have previously encountered.
arXiv Detail & Related papers (2023-10-08T04:22:33Z)
Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES. Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query. By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z)
What should I Ask: A Knowledge-driven Approach for Follow-up Questions Generation in Conversational Surveys [63.51903260461746]
We propose a novel task for knowledge-driven follow-up question generation in conversational surveys. We constructed a new human-annotated dataset of human-written follow-up questions with dialogue history and labeled knowledge. We then propose a two-staged knowledge-driven model for the task, which generates informative and coherent follow-up questions.
arXiv Detail & Related papers (2022-05-23T00:57:33Z)
Question Answering Survey: Directions, Challenges, Datasets, Evaluation Matrices [0.0]
The research directions of QA field are analyzed based on the type of question, answer type, source of evidence-answer, and modeling approach. This detailed followed by open challenges of the field like automatic question generation, similarity detection and, low resource availability for a language.
arXiv Detail & Related papers (2021-12-07T08:53:40Z)
A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset. We measure consensus between answers generated by the model and a set of relevant answers. We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.