Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model
- URL: http://arxiv.org/abs/2405.01591v1
- Date: Mon, 29 Apr 2024 13:23:33 GMT
- Title: Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model
- Authors: Seonhee Cho, Choonghan Kim, Jiho Lee, Chetan Chilkunda, Sujin Choi, Joo Heung Yoon,
- Abstract summary: We introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain Large Language Model (LLM) to process multimodal data via image descriptions.
MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data.
The robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications.
- Score: 3.012719451477384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises questions about the feasibility of these models when encountering with the inevitable variations and errors inherent in real-world medical data. In this paper, we introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain Large Language Model (LLM) to process multimodal data via image descriptions. MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data, with significantly fewer parameters. This highlights the potential of leveraging general-domain LLMs for domain-specific tasks and offers a sustainable and cost-effective alternative to traditional LMM developments. Moreover, the robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications.
Related papers
- The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective [53.48484062444108]
We find that the development of models and data is not two separate paths but rather interconnected.
On the one hand, vaster and higher-quality data contribute to better performance of MLLMs, on the other hand, MLLMs can facilitate the development of data.
To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective.
arXiv Detail & Related papers (2024-07-11T15:08:11Z) - HEMM: Holistic Evaluation of Multimodal Foundation Models [91.60364024897653]
Multimodal foundation models can holistically process text alongside images, video, audio, and other sensory modalities.
It is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains.
arXiv Detail & Related papers (2024-07-03T18:00:48Z) - MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era [72.95901753186227]
Multi-Modal Relation Understanding (MMRel) is a comprehensive dataset for studying inter-object relations with Multi-modal Large Language Models (MLLMs)
MMRel features three distinctive attributes: (i) It includes over 15K question-answer pairs, which are sourced from three distinct domains, ensuring large scale and high diversity; (ii) It contains a subset featuring highly unusual relations, on which MLLMs often fail due to hallucinations, thus are very challenging; (iii) It provides manually verified high-quality labels for inter-object relations.
arXiv Detail & Related papers (2024-06-13T13:51:59Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse [4.98050508891467]
We propose a two-stage approach for the construction of production prompts designed to yield high-quality data.
This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions.
We introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data.
arXiv Detail & Related papers (2024-03-14T08:27:32Z) - Model Composition for Multimodal Large Language Models [73.70317850267149]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - Cross-Modal Prototype based Multimodal Federated Learning under Severely
Missing Modality [31.727012729846333]
Multimodal Federated Cross Prototype Learning (MFCPL) is a novel approach for MFL under severely missing modalities.
MFCPL provides diverse modality knowledge in modality-shared level with the cross-modal regularization and modality-specific level with cross-modal contrastive mechanism.
Our approach introduces the cross-modal alignment to provide regularization for modality-specific features, thereby enhancing overall performance.
arXiv Detail & Related papers (2024-01-25T02:25:23Z) - Multimodal Question Answering for Unified Information Extraction [15.798187192290746]
Multimodal information extraction aims to extract structured information from unstructured multimedia content.
Most current MIE models are task-specific and data-intensive.
We propose a novel multimodal question answering (MQA) framework to unify three MIE tasks.
arXiv Detail & Related papers (2023-10-04T17:58:05Z) - Combining State-of-the-Art Models with Maximal Marginal Relevance for
Few-Shot and Zero-Shot Multi-Document Summarization [0.6690874707758508]
Multi-document summarization (MDS) poses many challenges to researchers above those posed by single-document summarization (SDS)
We propose a strategy for combining state-of-the-art models' outputs using maximal marginal relevance (MMR)
Our MMR-based approach shows improvement over some aspects of the current state-of-the-art results in both few-shot and zero-shot MDS applications.
arXiv Detail & Related papers (2022-11-19T21:46:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.