Related papers: On Domain-Specific Post-Training for Multimodal Large Language Models

On Domain-Specific Post-Training for Multimodal Large Language Models

URL: http://arxiv.org/abs/2411.19930v1
Date: Fri, 29 Nov 2024 18:42:28 GMT
Title: On Domain-Specific Post-Training for Multimodal Large Language Models
Authors: Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, Zhenliang Zhang,
Abstract summary: We develop a visual instruction synthesizer that generates diverse visual instruction tasks from domain-specific image-caption pairs.<n>We apply a single-stage training pipeline to enhance task diversity for domain-specific post-training.<n>We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales.
Score: 72.67107077850939
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent years have witnessed the rapid development of general multimodal large language models (MLLMs). However, adapting general MLLMs to specific domains, such as scientific fields and industrial applications, remains less explored. This paper systematically investigates domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation. (1) Data Synthesis: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs. (2) Training Pipeline: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training. (3) Task Evaluation: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks. To support further research in MLLM domain adaptation, we will open-source our implementations.

Related papers

Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model [63.14883657299359]
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering. tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert, where distribution shifts between pre-training and target datasets constrain target performance, and OpenWorld Stabilization, where catastrophic forgetting erases the model general knowledge.
arXiv Detail & Related papers (2025-03-06T15:29:13Z)
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons [85.99268361356832]
We introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA) GEA is a single unified model capable of grounding itself across varied domains through a multi-embodiment action tokenizer. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents.
arXiv Detail & Related papers (2024-12-11T15:06:25Z)
Way to Specialist: Closing Loop Between Specialized LLM and Evolving Domain Knowledge Graph [66.98553434041708]
Way-to-Specialist (WTS) framework synergizes retrieval-augmented generation with knowledge graphs.<n>"LLM$circlearrowright$KG" paradigm achieves bidirectional enhancement between specialized LLM and domain knowledge graph.
arXiv Detail & Related papers (2024-11-28T11:24:43Z)
FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data [64.50893177169996]
Fine-tuning Multimodal Large Language Models (MLLMs) with Federated Learning (FL) allows for expanding the training data scope by including private data sources. We introduce a benchmark for evaluating various downstream tasks in the federated fine-tuning of MLLMs within multimodal heterogeneous scenarios. We develop a general FedMLLM framework that integrates four representative FL methods alongside two modality-agnostic strategies.
arXiv Detail & Related papers (2024-11-22T04:09:23Z)
DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining [2.1534028009401713]
Large Language Models (LLMs) have shown ability to generalize effectively across numerous industry domains. LLMs exhibit limitations when tasked with performing in specialized or low-resource industry domains. In this work, we propose an automated and scalable framework - DoPAMine:Domain-specific Pre-training Adaptation from seed-guided data Mining.
arXiv Detail & Related papers (2024-09-30T22:15:58Z)
Industrial Language-Image Dataset (ILID): Adapting Vision Foundation Models for Industrial Settings [0.0]
We introduce a pipeline to generate the Industrial Language-Image dataset (ILID) based on web-crawled data. We demonstrate effective self-supervised transfer learning and discussing downstream tasks after training on the cheaply acquired ILID.
arXiv Detail & Related papers (2024-06-14T00:06:52Z)
BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models [56.89958793648104]
Large Language Models (LLMs) are versatile and capable of addressing a diverse range of tasks. Previous approaches either conduct continuous pre-training with domain-specific data or employ retrieval augmentation to support general LLMs. We present a novel framework named BLADE, which enhances Black-box LArge language models with small Domain-spEcific models.
arXiv Detail & Related papers (2024-03-27T08:57:21Z)
Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks [0.8192907805418583]
We develop a method to transform domain-specific visual and vision-language datasets into a unified question answering format called Visual Question Answering Instruction (VQA-IN) The proposed method achieved a high score metric on domainspecific visual tasks while also maintaining its performance on vision-language tasks in a multitask manner.
arXiv Detail & Related papers (2024-02-13T10:40:53Z)
A Framework to Implement 1+N Multi-task Fine-tuning Pattern in LLMs Using the CGC-LORA Algorithm [7.521690071464451]
We propose a unified framework that implements a 1 + N mutli-task fine-tuning pattern in large language models (LLMs) Our work aims to take an advantage of both MTL (i.e., CGC) and PEFT (i.e., LoRA) scheme.
arXiv Detail & Related papers (2024-01-22T07:58:31Z)
Knowledge Plugins: Enhancing Large Language Models for Domain-Specific Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE. This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z)
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs. This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.