MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology
- URL: http://arxiv.org/abs/2511.20490v1
- Date: Tue, 25 Nov 2025 16:56:25 GMT
- Title: MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology
- Authors: Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain, Phil F Cheng, Petros Liakopoulos, Olivier Michielin, Michael Moor, Charlotte Bunne,
- Abstract summary: Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical reasoning.<n>We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions.<n>Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance.
- Score: 37.556090746806845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability -- frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.
Related papers
- CoMMa: Contribution-Aware Medical Multi-Agents From A Game-Theoretic Perspective [17.875369977050926]
We propose Contribution-Aware Medical Multi-Agents (CoMMa) to tackle oncology decision support tasks.<n>Specialists operate on partitioned evidence and coordinate through a game-theoretic objective for robust decision-making.<n> Evaluated on diverse oncology benchmarks, CoMMa achieves higher accuracy and more stable performance than data-centralized and role-based multi-agents baselines.
arXiv Detail & Related papers (2026-02-09T20:04:58Z) - Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection [20.599682298329213]
We introduce SynCog, a novel framework integrating controllable zero-shot multimodal data synthesis with Chain-of-Thought deduction fine-tuning.<n>This generative paradigm enables the rapid, zero-shot expansion of clinical corpora across diverse languages.<n>Experiments on the ADReSS and ADReSSo benchmarks demonstrate that augmenting limited clinical data with synthetic phenotypes yields competitive diagnostic performance.
arXiv Detail & Related papers (2026-02-08T14:10:05Z) - MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement [63.82954136824963]
Medical Vision-Language Models excel at perception tasks with complex clinical reasoning required in real-world scenarios.<n>We propose a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and guideline reinforcement.
arXiv Detail & Related papers (2026-01-16T02:32:07Z) - M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding [66.78251988482222]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning.<n>Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path.<n>M3CoTBench aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
arXiv Detail & Related papers (2026-01-13T17:42:27Z) - MMCTOP: A Multimodal Textualization and Mixture-of-Experts Framework for Clinical Trial Outcome Prediction [6.546561764099574]
We propose MMCTOP to address the challenge of multimodal data fusion in high-dimensional biomedical informatics.<n> MMCTOP couples schema-guided textualization and input-fidelity validation with modality-aware representation learning.<n>Overall, MMCTOP advances multimodal trial modeling by combining controlled narrative normalization, context-aware expert fusion, and operational safeguards.
arXiv Detail & Related papers (2025-12-26T06:56:08Z) - OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction [2.904892426557913]
Large language models (LLMs) have shown strong performance in biomedical NLP.<n>We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction.<n>Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling.
arXiv Detail & Related papers (2025-10-20T13:35:12Z) - MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation [56.87891213797931]
We present MTR-Bench for Large Language Models' Multi-Turn Reasoning evaluation.<n>Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities.<n>MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations.
arXiv Detail & Related papers (2025-05-21T17:59:12Z) - ClusMFL: A Cluster-Enhanced Framework for Modality-Incomplete Multimodal Federated Learning in Brain Imaging Analysis [28.767460351377462]
In the context of brain imaging analysis, modality incompleteness presents a significant challenge.<n>We propose ClusMFL, a novel MFL framework that leverages feature clustering for cross-institutional brain imaging analysis.<n>ClusMFL achieves state-of-the-art performance compared to various baseline methods across varying levels of modality incompleteness.
arXiv Detail & Related papers (2025-02-14T09:33:59Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning [5.65203350495478]
We present Financial Cross-Modal Multi-Hop Reasoning (FCMR), a benchmark to analyze the reasoning capabilities of MLLMs.<n>FCMR is categorized into three difficulty levels-Easy, Medium, and Hard-facilitating a step-by-step evaluation.<n>Experiments on this new benchmark reveal that even state-of-the-art MLLMs struggle, with the best-performing model achieving only 30.4% accuracy on the most challenging tier.
arXiv Detail & Related papers (2024-12-17T05:50:55Z) - Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM [53.05486269607166]
multimodal large language models (MLLMs) have significantly enhanced performance across benchmarks.<n>Existing detection methods for unimodal large language models (LLMs) are inadequate for MLLMs due to multimodal data complexity and multi-phase training.<n>We analyze multimodal data contamination using our analytical framework, MM-Detect, which defines two contamination categories-unimodal and cross-modal.
arXiv Detail & Related papers (2024-11-06T10:44:15Z) - Examining Modality Incongruity in Multimodal Federated Learning for
Medical Vision and Language-based Disease Detection [7.515840210206994]
The impact of missing modality in different clients, also called modality incongruity, has been greatly overlooked.
This paper, for the first time, analyses the impact of modality incongruity and reveals its connection with data heterogeneity across participating clients.
arXiv Detail & Related papers (2024-02-07T22:16:53Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.