ArxivBench: Can LLMs Assist Researchers in Conducting Research?
- URL: http://arxiv.org/abs/2504.10496v1
- Date: Sun, 06 Apr 2025 05:00:10 GMT
- Title: ArxivBench: Can LLMs Assist Researchers in Conducting Research?
- Authors: Ning Li, Jingran Zhang, Justin Cui,
- Abstract summary: Large language models (LLMs) have demonstrated remarkable effectiveness in completing various tasks such as reasoning, translation, and question answering.<n>In this study, we evaluate both proprietary and open-source LLMs on their ability to respond with relevant research papers and accurate links to articles hosted on the arXiv platform.<n>Our findings reveal a concerning accuracy of LLM-generated responses depending on the subject, with some subjects experiencing significantly lower accuracy than others.
- Score: 6.586119023242877
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable effectiveness in completing various tasks such as reasoning, translation, and question answering. However the issue of factual incorrect content in LLM-generated responses remains a persistent challenge. In this study, we evaluate both proprietary and open-source LLMs on their ability to respond with relevant research papers and accurate links to articles hosted on the arXiv platform, based on high level prompts. To facilitate this evaluation, we introduce arXivBench, a benchmark specifically designed to assess LLM performance across eight major subject categories on arXiv and five subfields within computer science, one of the most popular categories among them. Our findings reveal a concerning accuracy of LLM-generated responses depending on the subject, with some subjects experiencing significantly lower accuracy than others. Notably, Claude-3.5-Sonnet exhibits a substantial advantage in generating both relevant and accurate responses. And interestingly, most LLMs achieve a much higher accuracy in the Artificial Intelligence sub-field than other sub-fields. This benchmark provides a standardized tool for evaluating the reliability of LLM-generated scientific responses, promoting more dependable use of LLMs in academic and research environments. Our code is open-sourced at https://github.com/arxivBenchLLM/arXivBench and our dataset is available on huggingface at https://huggingface.co/datasets/arXivBenchLLM/arXivBench.
Related papers
- ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.
We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.
We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z) - Highlighting Case Studies in LLM Literature Review of Interdisciplinary System Science [0.18416014644193066]
Large Language Models (LLMs) were used to assist four Commonwealth Scientific and Industrial Research Organisation (CSIRO) researchers.<n>We evaluate the performance of LLMs for systematic literature reviews.
arXiv Detail & Related papers (2025-03-16T05:52:18Z) - Latent Factor Models Meets Instructions:Goal-conditioned Latent Factor Discovery without Task Supervision [50.45597801390757]
Instruct-LF is a goal-oriented latent factor discovery system.<n>It integrates instruction-following ability with statistical models to handle noisy datasets.
arXiv Detail & Related papers (2025-02-21T02:03:08Z) - Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z) - GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.<n>GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading [100.02175403852253]
One common use of Large Language Models (LLMs) is performing tasks on scientific topics.
Inspired by the way university students are evaluated on such tasks, we propose SciEx - a benchmark consisting of university computer science exam questions.
We evaluate the performance of various state-of-the-art LLMs on our new benchmark.
arXiv Detail & Related papers (2024-06-14T21:52:21Z) - Large Language Models Memorize Sensor Datasets! Implications on Human Activity Recognition Research [0.23982628363233693]
We investigate whether Large Language Models (LLMs) have had access to standard Human Activity Recognition (HAR) datasets during training.
Most contemporary LLMs are trained on virtually the entire (accessible) internet -- potentially including standard HAR datasets.
For the Daphnet dataset in particular, GPT-4 is able to reproduce blocks of sensor readings.
arXiv Detail & Related papers (2024-06-09T19:38:27Z) - Attribution in Scientific Literature: New Benchmark and Methods [41.64918533152914]
Large language models (LLMs) present a promising yet challenging frontier for automated source citation in scientific communication.<n>We introduce REASONS, a novel dataset with sentence-level annotations across 12 scientific domains from arXiv.<n>We conduct extensive experiments with models such as GPT-O1, GPT-4O, GPT-3.5, DeepSeek, and other smaller models like Perplexity AI (7B)
arXiv Detail & Related papers (2024-05-03T16:38:51Z) - LLatrieval: LLM-Verified Retrieval for Verifiable Generation [67.93134176912477]
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents.
We propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question.
Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-11-14T01:38:02Z) - SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research [11.816426823341134]
We propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues.
Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability.
Both objective and subjective questions are included in SciEval.
arXiv Detail & Related papers (2023-08-25T03:05:33Z) - Large Language Models for Software Engineering: A Systematic Literature Review [34.12458948051519]
Large Language Models (LLMs) have significantly impacted numerous domains, including Software Engineering (SE)
We select and analyze 395 research papers from January 2017 to January 2024 to answer four key research questions (RQs)
From the answers to these RQs, we discuss the current state-of-the-art and trends, identifying gaps in existing research, and flagging promising areas for future study.
arXiv Detail & Related papers (2023-08-21T10:37:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.