Related papers: Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications

Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications

URL: http://arxiv.org/abs/2505.05736v1
Date: Fri, 09 May 2025 02:28:41 GMT
Title: Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications
Authors: Da Wu, Zhanliang Wang, Quan Nguyen, Zhuoran Xu, Kai Wang,
Abstract summary: MINT (Multimodal Integrated kNowledge Transfer) is a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data.<n> MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only models.
Score: 7.751808693373747
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference optimization. While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone. This strategy enables the aligned LLMs to perform predictive tasks using text-only or image-only inputs while retaining knowledge learnt from multimodal data. MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only LLMs. We demonstrate its effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for aligning a lightweight Llama 3.2-3B-Instruct. Despite relying on text input only, the MINT-derived model outperforms models trained with SFT, RAG, or DPO, and even outperforms Llama 3.1-405B-Instruct. (2) Tissue type classification using cell nucleus images, where MINT uses a vision-language foundation model as the preference generator, containing knowledge learnt from both text and histopathological images to align downstream image-only models. The resulting MINT-derived model significantly improves the performance of Llama 3.2-Vision-11B-Instruct on tissue type classification. In summary, MINT provides an effective strategy to align unimodal LLMs with high-quality multimodal expertise through preference optimization.

Related papers

MultiModal Fine-tuning with Synthetic Captions [9.572235167281686]
We propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs)<n>Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions for classification tasks.<n>Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning.
arXiv Detail & Related papers (2026-01-29T09:03:45Z)
LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation [8.769506450302154]
LADLE-MM is a model-soup multimodal misinformation detector with Learned Ensembles for Multimodal Misinformation.<n>It is composed of two unimodal branches and a third multimodal one that enhances image and text representations.<n>It achieves competitive performance on both binary and multi-label classification tasks.
arXiv Detail & Related papers (2025-12-23T11:14:58Z)
Libra-MIL: Multimodal Prototypes Stereoscopic Infused with Task-specific Language Priors for Few-shot Whole Slide Image Classification [18.928408687991368]
Large Language Models (LLMs) are emerging as a promising direction in computational pathology.<n>Existing vision-language Multi-Instance Learning (MIL) methods often employ unidirectional guidance.<n>We introduce a novel approach, Multimodal Prototype-based Multi-Instance Learning, that promotes bidirectional interaction.
arXiv Detail & Related papers (2025-11-11T07:46:38Z)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
CLIP-IT: CLIP-based Pairing for Histology Images Classification [6.855390956571216]
We introduce CLIP-IT to train a vision backbone model to classify histology images by pairing them with privileged textual information from an external source.<n>At first, the modality pairing step relies on a CLIP-based model to match histology images with semantically relevant textual report data from external sources, creating an augmented multimodal dataset.<n>A parameter-efficient fine-tuning method is used to efficiently address the misalignment between the main (image) and paired (text) modalities.
arXiv Detail & Related papers (2025-04-22T18:14:43Z)
MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks [50.98856172702256]
We propose the Modality-INformed knowledge Distillation (MIND) framework, a multimodal model compression approach.<n>MIND transfers knowledge from ensembles of pre-trained deep neural networks of varying sizes into a smaller multimodal student.<n>We evaluate MIND on binary and multilabel clinical prediction tasks using time series data and chest X-ray images.
arXiv Detail & Related papers (2025-02-03T08:50:00Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images. Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights. Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z)
VANER: Leveraging Large Language Model for Versatile and Adaptive Biomedical Named Entity Recognition [3.4923338594757674]
Large language models (LLMs) can be used to train a model capable of extracting various types of entities. In this paper, we utilize the open-sourced LLM LLaMA2 as the backbone model, and design specific instructions to distinguish between different types of entities and datasets. Our model VANER, trained with a small partition of parameters, significantly outperforms previous LLMs-based models and, for the first time, as a model based on LLM, surpasses the majority of conventional state-of-the-art BioNER systems.
arXiv Detail & Related papers (2024-04-27T09:00:39Z)
Residual-based Language Models are Free Boosters for Biomedical Imaging [15.154015369984572]
In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks. As a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D.
arXiv Detail & Related papers (2024-03-26T03:05:20Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification [14.820951153262685]
We introduce a new dataset, MELINDA, for Multimodal biomEdicaL experImeNt methoD clAssification. The dataset is collected in a fully automated distant supervision manner, where the labels are obtained from an existing curated database. We benchmark various state-of-the-art NLP and computer vision models, including unimodal models which only take either caption texts or images as inputs.
arXiv Detail & Related papers (2020-12-16T19:11:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.