A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task
- URL: http://arxiv.org/abs/2409.06883v1
- Date: Tue, 10 Sep 2024 21:54:46 GMT
- Title: A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task
- Authors: Yuya Fujisaki, Shiro Takagi, Hideki Asoh, Wataru Kumagai,
- Abstract summary: This dataset consists of machine learning papers, RQ extracted from research papers by GPT-4, and human evaluations of the extracted RQ.
Using this dataset, we systematically compared recently proposed LLM-based evaluation functions for summarizations.
We found that none of the functions showed sufficiently high correlations with human evaluations.
- Score: 6.757249766769395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The progress in text summarization techniques has been remarkable. However the task of accurately extracting and summarizing necessary information from highly specialized documents such as research papers has not been sufficiently investigated. We are focusing on the task of extracting research questions (RQ) from research papers and construct a new dataset consisting of machine learning papers, RQ extracted from these papers by GPT-4, and human evaluations of the extracted RQ from multiple perspectives. Using this dataset, we systematically compared recently proposed LLM-based evaluation functions for summarizations, and found that none of the functions showed sufficiently high correlations with human evaluations. We expect our dataset provides a foundation for further research on developing better evaluation functions tailored to the RQ extraction task, and contribute to enhance the performance of the task. The dataset is available at https://github.com/auto-res/PaperRQ-HumanAnno-Dataset.
Related papers
- Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.
We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.
Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers [0.0]
This paper presents a machine learning framework that automates dataset mention detection across research domains.
We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset.
At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall.
arXiv Detail & Related papers (2025-02-14T16:16:02Z) - MIMDE: Exploring the Use of Synthetic vs Human Data for Evaluating Multi-Insight Multi-Document Extraction Tasks [0.0]
We define a set of tasks, Multi-Insight Multi-Document Extraction (MIMDE) tasks.
This task is fundamental to many practical applications, from analyzing survey responses to processing medical records.
We introduce a novel set of complementary human and synthetic datasets to examine the potential of synthetic data for evaluation.
arXiv Detail & Related papers (2024-11-29T13:24:10Z) - CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets.
We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents.
We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges [3.130575840003799]
GPT-4 can be used to augment existing reading comprehension datasets.
This work serves to be the first analysis of LLMs as synthetic data augmenters for QA systems.
arXiv Detail & Related papers (2023-09-21T18:48:02Z) - QontSum: On Contrasting Salient Content for Query-focused Summarization [22.738731393540633]
Query-focused summarization (QFS) is a challenging task in natural language processing that generates summaries to address specific queries.
This paper highlights the role of QFS in Grounded Answer Generation (GAR)
We propose QontSum, a novel approach for QFS that leverages contrastive learning to help the model attend to the most relevant regions of the input document.
arXiv Detail & Related papers (2023-07-14T19:25:35Z) - QuOTeS: Query-Oriented Technical Summarization [0.2936007114555107]
We propose QuOTeS, an interactive system designed to retrieve sentences related to a summary of the research from a collection of potential references.
QuOTeS integrates techniques from Query-Focused Extractive Summarization and High-Recall Information Retrieval to provide Interactive Query-Focused Summarization of scientific documents.
The results show that QuOTeS provides a positive user experience and consistently provides query-focused summaries that are relevant, concise, and complete.
arXiv Detail & Related papers (2023-06-20T18:43:24Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Incorporating Relevance Feedback for Information-Seeking Retrieval using
Few-Shot Document Re-Ranking [56.80065604034095]
We introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant.
To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario.
arXiv Detail & Related papers (2022-10-19T16:19:37Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Abstractive Query Focused Summarization with Query-Free Resources [60.468323530248945]
In this work, we consider the problem of leveraging only generic summarization resources to build an abstractive QFS system.
We propose Marge, a Masked ROUGE Regression framework composed of a novel unified representation for summaries and queries.
Despite learning from minimal supervision, our system achieves state-of-the-art results in the distantly supervised setting.
arXiv Detail & Related papers (2020-12-29T14:39:35Z) - QBSUM: a Large-Scale Query-Based Document Summarization Dataset from
Real-world Applications [20.507631900617817]
We present QBSUM, a high-quality large-scale dataset consisting of 49,000+ data samples for the task of Chinese query-based document summarization.
We also propose multiple unsupervised and supervised solutions to the task and demonstrate their high-speed inference and superior performance via both offline experiments and online A/B tests.
arXiv Detail & Related papers (2020-10-27T07:30:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.