Related papers: MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

URL: http://arxiv.org/abs/2411.10557v3
Date: Sat, 28 Jun 2025 18:24:35 GMT
Title: MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models
Authors: Jianhong Tu, Zhuohao Ni, Nicholas Crispino, Zihao Yu, Michael Bendersky, Beliz Gunel, Ruoxi Jia, Xin Liu, Lingjuan Lyu, Dawn Song, Chenguang Wang,
Abstract summary: We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models.<n>We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.
Score: 79.0546136194314
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on-par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.

Related papers

Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z)
TowerVision: Understanding and Improving Multilinguality in Vision-Language Models [56.775118098058506]
TowerVision is a family of open multilingual vision-language models for both image-text and video-text tasks.<n>By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches.<n>To support further research, we publicly release all models, data, and training recipes.
arXiv Detail & Related papers (2025-10-22T17:02:48Z)
True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
Multimodal Large Language Models (MLLMs) have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks.<n>Current MLLMs tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation.<n>We introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context.
arXiv Detail & Related papers (2025-07-21T17:08:18Z)
Unified Multimodal Understanding via Byte-Pair Visual Encoding [34.96534298857146]
Multimodal large language models (MLLMs) have made significant progress in vision-language understanding.<n>We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens.
arXiv Detail & Related papers (2025-06-30T09:08:08Z)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning [2.401993998791928]
We propose a framework that trains a lightweight vision-language mapping network to connect modalities. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness.
arXiv Detail & Related papers (2024-08-12T16:15:32Z)
SwitchCIT: Switching for Continual Instruction Tuning [14.085371250265224]
Large language models (LLMs) and multimodal models (MMs) have exhibited impressive capabilities in various domains. Continual instruction tuning is crucial to adapt a large model to evolving tasks and domains. This work addresses the catastrophic forgetting in continual instruction learning through a mechanism for routing computations to parameter-efficient tuned models.
arXiv Detail & Related papers (2024-07-16T14:37:33Z)
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
SIMA is a framework that enhances visual and language modality alignment through self-improvement. It employs an in-context self-critic mechanism to select response pairs for preference tuning. We demonstrate that SIMA achieves superior modality alignment, outperforming previous approaches.
arXiv Detail & Related papers (2024-05-24T23:09:27Z)
Mosaic-IT: Free Compositional Data Augmentation Improves Instruction Tuning [30.82220015525281]
Mosaic Instruction Tuning (Mosaic-IT) is a human/model-free compositional data augmentation method. Mosaic-IT randomly creates rich and diverse augmentations from existing instruction tuning data. Our evaluations demonstrate a superior performance and training efficiency of Mosaic-IT.
arXiv Detail & Related papers (2024-05-22T04:08:20Z)
Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost. Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z)
Towards Robust Instruction Tuning on Multimodal Large Language Models [25.506776502317436]
In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. Results on two popular multimodal instructionfollowing benchmarks show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks.
arXiv Detail & Related papers (2024-02-22T12:35:50Z)
VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons. We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z)
Language Grounded QFormer for Efficient Vision Language Understanding [25.432918254523344]
We take inspiration from the Query Transformer (QFormer) approach proposed in BLIP-2 models for bridging frozen modalities. We propose a more efficient method for QFormer-based vision-language alignment.
arXiv Detail & Related papers (2023-11-13T16:30:49Z)
Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem. We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z)
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [43.54069813039309]
We study vision-language instruction tuning based on the pretrained BLIP-2 models. InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks.
arXiv Detail & Related papers (2023-05-11T00:38:10Z)
ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs. Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z)
Multimodal Semi-Supervised Learning for Text Recognition [10.33262222726707]
We present semi-supervised learning for multimodal text recognizers (SemiMTR) that leverages unlabeled data at each modality training phase. Our algorithm starts by pretraining the vision model through a single-stage training that unifies self-supervised learning with supervised training. In a novel setup, consistency is enforced on each modality separately.
arXiv Detail & Related papers (2022-05-08T13:55:30Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding [73.24847320536813]
This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders. Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
arXiv Detail & Related papers (2022-04-15T03:44:00Z)
InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training. We propose a new pre-training task based on contrastive learning. By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
DIET: Lightweight Language Understanding for Dialogue Systems [0.0]
Large-scale pre-trained language models have shown impressive results on language understanding benchmarks like GLUE and SuperGLUE. We introduce the Dual Intent and Entity Transformer (DIET) architecture, and study the effectiveness of different pre-trained representations on intent and entity prediction.
arXiv Detail & Related papers (2020-04-21T12:10:48Z)
Balancing Training for Multilingual Neural Machine Translation [130.54253367251738]
multilingual machine translation (MT) models can translate to/from multiple languages. Standard practice is to up-sample less resourced languages to increase representation. We propose a method that instead automatically learns how to weight training data through a data scorer.
arXiv Detail & Related papers (2020-04-14T18:23:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.