Related papers: TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents

TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents

URL: http://arxiv.org/abs/2603.01241v1
Date: Sun, 01 Mar 2026 19:31:23 GMT
Title: TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents
Authors: Junda Wang, Zonghai Tao, Hansi Zeng, Zhichao Yang, Hamed Zamani, Hong Yu,
Abstract summary: We frame clinical question answering as an agent problem with two explicit, retrievable resources.<n>We build a skills library from guideline-style documents organized as executable decision rules.<n>We then adapt the model on the retrieved items to reduce instance-step misalignment.
Score: 30.35248346284844
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Complex clinical decision making often fails not because a model lacks facts, but because it cannot reliably select and apply the right procedural knowledge and the right prior example at the right reasoning step. We frame clinical question answering as an agent problem with two explicit, retrievable resources: skills, reusable clinical procedures such as guidelines, protocols, and pharmacologic mechanisms; and experience, verified reasoning trajectories from previously solved cases (e.g., chain-of-thought solutions and their step-level decompositions). At test time, the agent retrieves both relevant skills and experiences from curated libraries and performs lightweight test-time adaptation to align the language model's intermediate reasoning with clinically valid logic. Concretely, we build (i) a skills library from guideline-style documents organized as executable decision rules, (ii) an experience library of exemplar clinical reasoning chains indexed by step-level transitions, and (iii) a step-aware retriever that selects the most useful skill and experience items for the current case. We then adapt the model on the retrieved items to reduce instance-step misalignment and to prevent reasoning from drifting toward unsupported shortcuts. Experiments on medical question-answering benchmarks show consistent gains over strong medical RAG baselines and prompting-only reasoning methods. Our results suggest that explicitly separating and retrieving clinical skills and experience, and then aligning the model at test time, is a practical approach to more reliable medical agents.

Related papers

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification [60.18369393468405]
Existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration.<n>GLEAN compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals.<n>We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset.
arXiv Detail & Related papers (2026-03-03T09:36:43Z)
Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning [16.144050164828794]
We propose Differential Reasoning Learning (DRL), a framework that improves clinical agents by learning from reasoning discrepancies.<n>DRL extracts reasoning graphs as directed acyclic graphs (DAGs) and performs a clinically weighted graph edit distance (GED)-based discrepancy analysis.<n>At inference, we retrieve top-$k$ instructions to augment the agent prompt and patch likely logic gaps.
arXiv Detail & Related papers (2026-02-10T16:29:32Z)
AgentScore: Autoformulation of Deployable Clinical Scoring Systems [45.88028371034407]
We introduce AgentScore, which performs semantically guided optimization in unit-weighted clinical checklists.<n>AgentScore outperforms existing score-generation methods and achieves AUC comparable to more flexible interpretable models.<n>On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.
arXiv Detail & Related papers (2026-01-29T21:11:06Z)
Timely Clinical Diagnosis through Active Test Selection [49.091903570068155]
We propose ACTMED (Adaptive Clinical Test selection via Model-based Experimental Design) to better emulate real-world diagnostic reasoning.<n>LLMs act as flexible simulators, generating plausible patient state distributions and supporting belief updates without requiring structured, task-specific training data.<n>We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use.
arXiv Detail & Related papers (2025-10-21T18:10:45Z)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z)
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z)
Towards Next-Generation Medical Agent: How o1 is Reshaping Decision-Making in Medical Scenarios [46.729092855387165]
We study the choice of the backbone LLM for medical AI agents, which is the foundation for the agent's overall reasoning and action generation.<n>Our findings demonstrate o1's ability to enhance diagnostic accuracy and consistency, paving the way for smarter, more responsive AI tools.
arXiv Detail & Related papers (2024-11-16T18:19:53Z)
ArgMed-Agents: Explainable Clinical Decision Reasoning with LLM Disscusion via Argumentation Schemes [7.950883198425716]
ArgMed-Agents is a framework to enable large language models (LLMs) to make explainable clinical decision reasoning through interaction.<n>We construct a formal model of ArgMed-Agents and present conjectures for theoretical guarantees.<n> setup experiments show that ArgMed-Agents not only improves accuracy in complex clinical decision reasoning problems compared to other prompt methods, but more importantly, it provides users with decision explanations that increase their confidence.
arXiv Detail & Related papers (2024-03-10T19:47:00Z)
LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning [61.7853049843921]
Chain-of-thought (CoT) prompting is a popular in-context learning approach for large language models (LLMs)<n>This paper introduces a new approach named Latent Reasoning Skills (LaRS) that employs unsupervised learning to create a latent space representation of rationales.
arXiv Detail & Related papers (2023-12-07T20:36:10Z)
Inverse Contextual Bandits: Learning How Behavior Evolves over Time [89.59391124399927]
We seek an approach to policy learning that provides interpretable representations of decision-making. First, we model the behavior of learning agents in terms of contextual bandits, and formalize the problem of inverse contextual bandits (ICB) Second, we propose two algorithms to tackle ICB, each making varying degrees of assumptions regarding the agent's learning strategy.
arXiv Detail & Related papers (2021-07-13T18:24:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.