Related papers: Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications

Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications

URL: http://arxiv.org/abs/2409.16605v1
Date: Wed, 25 Sep 2024 04:12:38 GMT
Title: Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications
Authors: Ethan Lin, Zhiyuan Peng, Yi Fang,
Abstract summary: We introduce a scholarly novelty benchmark (SchNovel) to evaluate large language models' ability to assess novelty in scholarly papers. SchNovel consists of 15000 pairs of papers across six fields sampled from the arXiv dataset with publication dates spanning 2 to 10 years apart. RAG-Novelty simulates the review process taken by human reviewers by leveraging the retrieval of similar papers to assess novelty.
Score: 12.183473842592567
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies have evaluated the creativity/novelty of large language models (LLMs) primarily from a semantic perspective, using benchmarks from cognitive science. However, accessing the novelty in scholarly publications is a largely unexplored area in evaluating LLMs. In this paper, we introduce a scholarly novelty benchmark (SchNovel) to evaluate LLMs' ability to assess novelty in scholarly papers. SchNovel consists of 15000 pairs of papers across six fields sampled from the arXiv dataset with publication dates spanning 2 to 10 years apart. In each pair, the more recently published paper is assumed to be more novel. Additionally, we propose RAG-Novelty, which simulates the review process taken by human reviewers by leveraging the retrieval of similar papers to assess novelty. Extensive experiments provide insights into the capabilities of different LLMs to assess novelty and demonstrate that RAG-Novelty outperforms recent baseline models.

Related papers

Modelling and Classifying the Components of a Literature Review [0.0]
We present a novel benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using large language models (LLMs)<n>The experiments yield several novel insights that advance the state of the art in this challenging domain.
arXiv Detail & Related papers (2025-08-06T11:30:07Z)
Automated Novelty Evaluation of Academic Paper: A Collaborative Approach Integrating Human and Large Language Model Knowledge [9.208744138848765]
One of the most common types of novelty in academic papers is the introduction of new methods.<n>In this paper, we propose leveraging human knowledge and LLM to assist pretrained language models (PLMs) in predicting the method novelty of papers.
arXiv Detail & Related papers (2025-07-15T14:03:55Z)
Literature-Grounded Novelty Assessment of Scientific Ideas [23.481266336046833]
We propose the Idea Novelty Checker, an LLM-based retrieval-augmented generation framework.<n>Our experiments demonstrate that our novelty checker achieves approximately 13% higher agreement than existing approaches.
arXiv Detail & Related papers (2025-06-27T08:47:28Z)
Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition [2.048226951354646]
Large language models (LLMs) have emerged as a potential solution to automate the complex processes involved in writing literature reviews. This study introduces a framework to automatically evaluate the performance of LLMs in three key tasks of literature writing.
arXiv Detail & Related papers (2024-12-18T08:42:25Z)
LitLLMs, LLMs for Literature Review: Are we there yet? [15.785989492351684]
This paper explores the zero-shot abilities of recent Large Language Models in assisting with the writing of literature reviews based on an abstract. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review.
arXiv Detail & Related papers (2024-12-15T01:12:26Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is an AI-based system for ideation and operationalization of novel work. ResearchAgent automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them. We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z)
Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z)
A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [58.6354685593418]
This paper proposes several article-level, field-normalized, and large language model-empowered bibliometric indicators to evaluate reviews. The newly emerging AI-generated literature reviews are also appraised. This work offers insights into the current challenges of literature reviews and envisions future directions for their development.
arXiv Detail & Related papers (2024-02-20T11:28:50Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
MERA: A Comprehensive LLM Evaluation in Russian [43.02318109348788]
We introduce an open Multimodal Evaluation of Russian-language Architectures (MERA) benchmark for evaluating foundation models. The benchmark encompasses 21 evaluation tasks for generative models in 11 skill domains. We propose an evaluation methodology, an open-source code base for the MERA assessment, and a leaderboard with a submission system.
arXiv Detail & Related papers (2024-01-09T12:55:21Z)
Expanding Horizons in HCI Research Through LLM-Driven Qualitative Analysis [3.5253513747455303]
We introduce a new approach to qualitative analysis in HCI using Large Language Models (LLMs) Our findings indicate that LLMs not only match the efficacy of traditional analysis methods but also offer unique insights.
arXiv Detail & Related papers (2024-01-07T12:39:31Z)
L-Eval: Instituting Standardized Evaluation for Long Context Language Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark [54.782397511033345]
Crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts. We present a new unified cross-datasets benchmark with 14 NLI datasets and re-evaluate 9 widely-used neural network-based NLI models. Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.
arXiv Detail & Related papers (2020-10-15T11:50:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.