STRICTA: Structured Reasoning in Critical Text Assessment for Peer Review and Beyond
- URL: http://arxiv.org/abs/2409.05367v2
- Date: Mon, 02 Jun 2025 08:18:21 GMT
- Title: STRICTA: Structured Reasoning in Critical Text Assessment for Peer Review and Beyond
- Authors: Nils Dycke, Matej Zečević, Ilia Kuznetsov, Beatrix Suess, Kristian Kersting, Iryna Gurevych,
- Abstract summary: We introduce Structured Reasoning In Critical Text Assessment (STRICTA) to model text assessment as an explicit, step-wise reasoning process.<n>STRICTA breaks down the assessment into a graph of interconnected reasoning steps drawing on causality theory.<n>We apply STRICTA to a dataset of over 4000 reasoning steps from roughly 40 biomedical experts on more than 20 papers.
- Score: 68.47402386668846
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Critical text assessment is at the core of many expert activities, such as fact-checking, peer review, and essay grading. Yet, existing work treats critical text assessment as a black box problem, limiting interpretability and human-AI collaboration. To close this gap, we introduce Structured Reasoning In Critical Text Assessment (STRICTA), a novel specification framework to model text assessment as an explicit, step-wise reasoning process. STRICTA breaks down the assessment into a graph of interconnected reasoning steps drawing on causality theory (Pearl, 1995). This graph is populated based on expert interaction data and used to study the assessment process and facilitate human-AI collaboration. We formally define STRICTA and apply it in a study on biomedical paper assessment, resulting in a dataset of over 4000 reasoning steps from roughly 40 biomedical experts on more than 20 papers. We use this dataset to empirically study expert reasoning in critical text assessment, and investigate if LLMs are able to imitate and support experts within these workflows. The resulting tools and datasets pave the way for studying collaborative expert-AI reasoning in text assessment, in peer review and beyond.
Related papers
- Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach [0.7845950813414773]
This work proposes and analyses the performance of a cohesion score prediction approach based on item response theory.<n>The proposed approach outperforms conventional machine learning models and ensemble methods in several evaluation metrics.
arXiv Detail & Related papers (2025-07-11T11:05:27Z) - Causality for Natural Language Processing [17.681875945732042]
Causal reasoning is a cornerstone of human intelligence and a critical capability for artificial systems.
This thesis delves into various dimensions of causal reasoning and understanding in large language models.
arXiv Detail & Related papers (2025-04-20T08:11:11Z) - Identifying Aspects in Peer Reviews [61.374437855024844]
We develop a data-driven schema for deriving aspects from a corpus of peer reviews.<n>We introduce a dataset of peer reviews augmented with aspects and show how it can be used for community-level review analysis.
arXiv Detail & Related papers (2025-04-09T14:14:42Z) - Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios.<n>Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models.<n>We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z) - exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment Problem [11.763640675057076]
We develop a benchmark dataset for evaluating the reviewer assignment problem without needing explicit labels.<n>We benchmark various methods, including traditional lexical matching, static neural embeddings, and contextualized neural embeddings.<n>Our results indicate that while traditional methods perform reasonably well, contextualized embeddings trained on scholarly literature show the best performance.
arXiv Detail & Related papers (2025-02-11T16:35:04Z) - Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research [7.4327380079414676]
We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents.
Our framework introduces the Mind Map agent, which constructs a structured knowledge graph to track logical relationships.
Evaluations on PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks demonstrate that our approach significantly outperforms existing models.
arXiv Detail & Related papers (2025-02-07T04:08:46Z) - LLM-Generated Heuristics for AI Planning: Do We Even Need Domain-Independence Anymore? [87.71321254733384]
Large language models (LLMs) can generate planning approaches tailored to specific planning problems.
LLMs can achieve state-of-the-art performance on some standard IPC domains.
We discuss whether these results signify a paradigm shift and how they can complement existing planning approaches.
arXiv Detail & Related papers (2025-01-30T22:21:12Z) - Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training [10.701353329227722]
We propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature.
Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain.
Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain.
arXiv Detail & Related papers (2025-01-25T07:20:44Z) - Retrieval-Enhanced Machine Learning: Synthesis and Opportunities [60.34182805429511]
Retrieval-enhancement can be extended to a broader spectrum of machine learning (ML)
This work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature.
The goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
arXiv Detail & Related papers (2024-07-17T20:01:21Z) - M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering [14.198330378235632]
We use Multiple Choice and Abstractive Question Answering to conduct a large-scale empirical study on 22 datasets in three generalist and three specialist biomedical sub-domains.
Our multifaceted analysis of the performance of 15 LLMs uncovers success factors such as instruction tuning that lead to improved recall and comprehension.
We show that while recently proposed domain-adapted models may lack adequate knowledge, directly fine-tuning on our collected medical knowledge datasets shows encouraging results.
We complement the quantitative results with a skill-oriented manual error analysis, which reveals a significant gap between the models' capabilities to simply recall necessary knowledge and to integrate it with the presented
arXiv Detail & Related papers (2024-06-06T02:43:21Z) - A Survey of Artificial Intelligence in Gait-Based Neurodegenerative Disease Diagnosis [51.07114445705692]
neurodegenerative diseases (NDs) traditionally require extensive healthcare resources and human effort for medical diagnosis and monitoring.
As a crucial disease-related motor symptom, human gait can be exploited to characterize different NDs.
The current advances in artificial intelligence (AI) models enable automatic gait analysis for NDs identification and classification.
arXiv Detail & Related papers (2024-05-21T06:44:40Z) - Conversational Disease Diagnosis via External Planner-Controlled Large Language Models [18.93345199841588]
This study presents a LLM-based diagnostic system that enhances planning capabilities by emulating doctors.
By utilizing real patient electronic medical record data, we constructed simulated dialogues between virtual patients and doctors.
arXiv Detail & Related papers (2024-04-04T06:16:35Z) - A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [55.33653554387953]
Pattern Analysis and Machine Intelligence (PAMI) has led to numerous literature reviews aimed at collecting and fragmented information.<n>This paper presents a thorough analysis of these literature reviews within the PAMI field.<n>We try to address three core research questions: (1) What are the prevalent structural and statistical characteristics of PAMI literature reviews; (2) What strategies can researchers employ to efficiently navigate the growing corpus of reviews; and (3) What are the advantages and limitations of AI-generated reviews compared to human-authored ones.
arXiv Detail & Related papers (2024-02-20T11:28:50Z) - The Radiation Oncology NLP Database [33.391114383354804]
We present the Radiation Oncology NLP Database (ROND), the first dedicated Natural Language Processing (NLP) dataset for radiation oncology.
ROND is specifically designed to address this gap in the domain of radiation oncology.
It encompasses various NLP tasks including Logic Reasoning, Text Classification, Named Entity Recognition (NER), Question Answering (QA), Text Summarization, and Patient-Clinician Conversations.
arXiv Detail & Related papers (2024-01-19T19:23:37Z) - From Voices to Validity: Leveraging Large Language Models (LLMs) for
Textual Analysis of Policy Stakeholder Interviews [14.135107583299277]
This study explores the integration of Large Language Models (LLMs) with human expertise to enhance text analysis of stakeholder interviews regarding K-12 education policy within one U.S. state.
Using a mixed-methods approach, human experts developed a codebook and coding processes as informed by domain knowledge and unsupervised topic modeling results.
Results reveal that while GPT-4 thematic coding aligned with human coding by 77.89% at specific themes, expanding to broader themes increased congruence to 96.02%, surpassing traditional Natural Language Processing (NLP) methods by over 25%.
arXiv Detail & Related papers (2023-12-02T18:55:14Z) - Injecting linguistic knowledge into BERT for Dialogue State Tracking [60.42231674887294]
This paper proposes a method that extracts linguistic knowledge via an unsupervised framework.
We then utilize this knowledge to augment BERT's performance and interpretability in Dialogue State Tracking (DST) tasks.
We benchmark this framework on various DST tasks and observe a notable improvement in accuracy.
arXiv Detail & Related papers (2023-11-27T08:38:42Z) - Unveiling A Core Linguistic Region in Large Language Models [49.860260050718516]
This paper conducts an analogical research using brain localization as a prototype.
We have discovered a core region in large language models that corresponds to linguistic competence.
We observe that an improvement in linguistic competence does not necessarily accompany an elevation in the model's knowledge level.
arXiv Detail & Related papers (2023-10-23T13:31:32Z) - Exploring the Cognitive Knowledge Structure of Large Language Models: An
Educational Diagnostic Assessment Approach [50.125704610228254]
Large Language Models (LLMs) have not only exhibited exceptional performance across various tasks, but also demonstrated sparks of intelligence.
Recent studies have focused on assessing their capabilities on human exams and revealed their impressive competence in different domains.
We conduct an evaluation using MoocRadar, a meticulously annotated human test dataset based on Bloom taxonomy.
arXiv Detail & Related papers (2023-10-12T09:55:45Z) - ChatGPT-HealthPrompt. Harnessing the Power of XAI in Prompt-Based
Healthcare Decision Support using ChatGPT [15.973406739758856]
This study presents an innovative approach to the application of large language models (LLMs) in clinical decision-making, focusing on OpenAI's ChatGPT.
Our approach introduces the use of contextual prompts-strategically designed to include task description, feature description, and crucially, integration of domain knowledge-for high-quality binary classification tasks even in data-scarce scenarios.
arXiv Detail & Related papers (2023-08-17T20:50:46Z) - Multi-Task Training with In-Domain Language Models for Diagnostic
Reasoning [5.321587036724933]
We present a comparative analysis of in-domain versus out-of-domain language models as well as multi-task versus single task training.
We demonstrate that a multi-task, clinically trained language model outperforms its general domain counterpart by a large margin.
arXiv Detail & Related papers (2023-06-07T15:55:34Z) - Multi-Dimensional Evaluation of Text Summarization with In-Context
Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning.
Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization.
We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z) - Automated Evaluation for Student Argumentative Writing: A Survey [2.9466390764652415]
This paper surveys and organizes research works in an under-studied area, which we call automated evaluation for student argumentative writing.
Unlike traditional automated writing evaluation that focuses on holistic essay scoring, this field is more specific: it focuses on evaluating argumentative essays and offers specific feedback.
arXiv Detail & Related papers (2022-05-09T07:27:59Z) - Revise and Resubmit: An Intertextual Model of Text-based Collaboration
in Peer Review [52.359007622096684]
Peer review is a key component of the publishing process in most fields of science.
Existing NLP studies focus on the analysis of individual texts.
editorial assistance often requires modeling interactions between pairs of texts.
arXiv Detail & Related papers (2022-04-22T16:39:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.