A Multi-Task Evaluation of LLMs' Processing of Academic Text Input
- URL: http://arxiv.org/abs/2508.11779v1
- Date: Fri, 15 Aug 2025 19:05:57 GMT
- Title: A Multi-Task Evaluation of LLMs' Processing of Academic Text Input
- Authors: Tianyi Li, Yu Qin, Olivia R. Liu Sheng,
- Abstract summary: How much large language models (LLMs) can aid scientific discovery, notably in assisting academic peer review is in heated debate.<n>We organize individual tasks that computer science studies employ in separate terms into a guided and robust workflow to evaluate LLMs' processing of academic text input.<n>We employ four tasks in the assessment: content reproduction/comparison/scoring/reflection, each demanding a specific role of the LLM.
- Score: 6.654906601143054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How much large language models (LLMs) can aid scientific discovery, notably in assisting academic peer review, is in heated debate. Between a literature digest and a human-comparable research assistant lies their practical application potential. We organize individual tasks that computer science studies employ in separate terms into a guided and robust workflow to evaluate LLMs' processing of academic text input. We employ four tasks in the assessment: content reproduction/comparison/scoring/reflection, each demanding a specific role of the LLM (oracle/judgmental arbiter/knowledgeable arbiter/collaborator) in assisting scholarly works, and altogether testing LLMs with questions that increasingly require intellectual capabilities towards a solid understanding of scientific texts to yield desirable solutions. We exemplify a rigorous performance evaluation with detailed instructions on the prompts. Adopting first-rate Information Systems articles at three top journals as the input texts and an abundant set of text metrics, we record a compromised performance of the leading LLM - Google's Gemini: its summary and paraphrase of academic text is acceptably reliable; using it to rank texts through pairwise text comparison is faintly scalable; asking it to grade academic texts is prone to poor discrimination; its qualitative reflection on the text is self-consistent yet hardly insightful to inspire meaningful research. This evidence against an endorsement of LLMs' text-processing capabilities is consistent across metric-based internal (linguistic assessment), external (comparing to the ground truth), and human evaluation, and is robust to the variations of the prompt. Overall, we do not recommend an unchecked use of LLMs in constructing peer reviews.
Related papers
- Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs [0.0]
Writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning.<n>The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays.<n>This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use.
arXiv Detail & Related papers (2026-03-02T19:51:01Z) - LLM-REVal: Can We Trust LLM Reviewers Yet? [70.58742663985652]
Large language models (LLMs) have inspired researchers to integrate them extensively into the academic workflow.<n>This study focuses on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness.
arXiv Detail & Related papers (2025-10-14T10:30:20Z) - Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper [64.50822834679101]
SciIG is a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works.<n>We assess five state-of-the-art models, including open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems.<n>Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness.
arXiv Detail & Related papers (2025-08-19T21:11:11Z) - Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.<n>We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.<n>Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts [19.02690795530784]
We propose a Contextualized Spoken-to-Written conversion (CoS2W) task to address ASR and grammar errors.<n>This task naturally matches the in-context learning capabilities of Large Language Models (LLMs)
arXiv Detail & Related papers (2024-08-19T03:53:48Z) - LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks.
This study focuses on the topic of LLMs assist NLP Researchers.
To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts [76.18347405302728]
This study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic.
The adversarial prompts are then employed in diverse tasks including sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving.
Our findings demonstrate that contemporary Large Language Models are not robust to adversarial prompts.
arXiv Detail & Related papers (2023-06-07T15:37:00Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.