Related papers: Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment

Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment

URL: http://arxiv.org/abs/2508.16839v3
Date: Sun, 31 Aug 2025 22:39:41 GMT
Title: Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment
Authors: Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, Navin Kumar,
Abstract summary: We present a framework that uses a single vision-language model (VLM) in two complementary roles.<n>First, the VLM acts as an aware model-card matcher that routes an incoming image to the appropriate specialist model.<n>Second, we fine-tune the VLM on specialty-specific datasets ensuring a single model covers multiple downstream tasks.
Score: 6.7202991099968346
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Clinical workflows are fragmented as a patchwork of scripts and task-specific networks that often handle triage, task selection, and model deployment. These pipelines are rarely streamlined for data science pipeline, reducing efficiency and raising operational costs. Workflows also lack data-driven model identification (from imaging/tabular inputs) and standardized delivery of model outputs. In response, we present a practical, healthcare-first framework that uses a single vision-language model (VLM) in two complementary roles. First (Solution 1), the VLM acts as an aware model-card matcher that routes an incoming image to the appropriate specialist model via a three-stage workflow (modality -> primary abnormality -> model-card id). Checks are provided by (i) stagewise prompts that allow early exit via None/Normal/Other and (ii) a stagewise answer selector that arbitrates between the top-2 candidates at each stage, reducing the chance of an incorrect selection and aligning the workflow with clinical risk tolerance. Second (Solution 2), we fine-tune the VLM on specialty-specific datasets ensuring a single model covers multiple downstream tasks within each specialty, maintaining performance while simplifying deployment. Across gastroenterology, hematology, ophthalmology, and pathology, our single-model deployment matches or approaches specialized baselines. Compared with pipelines composed of many task-specific agents, this approach shows that one VLM can both decide and do. It may reduce effort by data scientists, shorten monitoring, increase the transparency of model selection (with per-stage justifications), and lower integration overhead.

Related papers

Model Specific Task Similarity for Vision Language Model Selection via Layer Conductance [92.72779885657373]
We propose a framework that grounds model selection in the internal functional dynamics of the visual encoder.<n>Our approach represents each task via layer wise conductance and derives a target-conditioned block importance distribution through entropy regularized alignment.<n>Building on this, we introduce Directional Conductance Divergence (DCD), an asymmetric metric that quantifies how effectively a source task covers the target's salient functional blocks.
arXiv Detail & Related papers (2026-02-01T17:29:43Z)
Effortless Vision-Language Model Specialization in Histopathology without Annotation [0.4154350202907906]
Vision-Language Models (VLMs) have demonstrated impressive zero-shot classification capabilities across various tasks.<n>Their general-purpose design may lead to suboptimal performance in specific downstream applications.<n>This paper investigates annotation-free adaptation of VLMs through continued pretraining on domain- and task-relevant image-caption pairs.
arXiv Detail & Related papers (2025-08-11T10:39:27Z)
Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision [1.890063512530524]
MML-SurgAdapt is a unified multi-task framework to handle diverse surgical tasks through natural language supervision.<n>A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks.<n>Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations.
arXiv Detail & Related papers (2025-07-07T14:03:10Z)
Model alignment using inter-modal bridges [0.6906005491572401]
Existing methods require extensive paired training data or are constrained to specific domains.<n>We introduce a semi-supervised approach for model alignment via conditional flow matching.<n>Our method provides a data-efficient solution for inter-modal model alignment with minimal supervision.
arXiv Detail & Related papers (2025-05-18T09:30:02Z)
Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning [46.51245338355645]
Model merging is a flexible and computationally tractable approach to merge single-task checkpoints into a multi-task model.<n>We show that it qualitatively differs from the single-input-multiple-output model merging settings studied in the literature due to the existence of task-specific decoders.<n>We propose two simple and efficient fixes for the SIMO setting to re-align the feature representation after merging.
arXiv Detail & Related papers (2025-04-15T15:10:46Z)
Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC [77.8851460746251]
We propose a simple, efficient, and general approach to fine-tune diffusion models.<n> ONE-PIC enhances the inherited generative ability in the pretrained diffusion models without introducing additional modules.<n>Our method is simple and efficient which streamlines the adaptation process and achieves excellent performance with lower costs.
arXiv Detail & Related papers (2024-12-07T11:19:32Z)
DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem. To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects. In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z)
Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization [27.472705540825316]
Action localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. We show that a single MViTv2-S architecture trained with bipartite matching to perform both tasks surpasses the same MViTv2-S when trained with RoI align on pre-computed bounding boxes.
arXiv Detail & Related papers (2023-12-29T17:08:38Z)
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z)
Single-Stage Visual Relationship Learning using Conditional Queries [60.90880759475021]
TraCQ is a new formulation for scene graph generation that avoids the multi-task learning problem and the entity pair distribution. We employ a DETR-based encoder-decoder conditional queries to significantly reduce the entity label space as well. Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset.
arXiv Detail & Related papers (2023-06-09T06:02:01Z)
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z)
Scanflow: A multi-graph framework for Machine Learning workflow management, supervision, and debugging [0.0]
We propose a novel containerized directed graph framework to support end-to-end Machine Learning workflow management. The framework allows defining and deploying ML in containers, tracking their metadata, checking their behavior in production, and improving the models by using both learned and human-provided knowledge.
arXiv Detail & Related papers (2021-11-04T17:01:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.