Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and Training
- URL: http://arxiv.org/abs/2412.08307v1
- Date: Wed, 11 Dec 2024 11:39:42 GMT
- Title: Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and Training
- Authors: Shijian Wang, Linxin Song, Jieyu Zhang, Ryotaro Shimizu, Ao Luo, Li Yao, Cunjian Chen, Julian McAuley, Hanqian Wu,
- Abstract summary: We propose a programmatic instruction template generator capable of producing over 39B unique template combinations.<n>Experiments across eight commons on five benchmark datasets have high template sensitivities with at most 29% performance gaps between different templates.<n>Models tuned on our augmented dataset achieve the best overall performance when compared with the same scales tuned on at most 75 times the scale of our augmented dataset.
- Score: 27.764452541732226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current multimodal language models (MLMs) evaluation and training approaches overlook the influence of instruction format, presenting an elephant-in-the-room problem. Previous research deals with this problem by manually crafting instructions, failing to yield significant insights due to limitations in diversity and scalability. In this work, we propose a programmatic instruction template generator capable of producing over 39B unique template combinations by filling randomly sampled positional synonyms into weighted sampled meta templates, enabling us to comprehensively examine the MLM's performance across diverse instruction templates. Our experiments across eight common MLMs on five benchmark datasets reveal that MLMs have high template sensitivities with at most 29% performance gaps between different templates. We further augment the instruction tuning dataset of LLaVA-1.5 with our template generator and perform instruction tuning on LLaVA-1.5-7B and LLaVA-1.5-13B. Models tuned on our augmented dataset achieve the best overall performance when compared with the same scale MLMs tuned on at most 75 times the scale of our augmented dataset, highlighting the importance of instruction templates in MLM training. The code is available at https://github.com/shijian2001/TemplateMatters .
Related papers
- Should We Still Pretrain Encoders with Masked Language Modeling? [27.19054714197245]
Recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders.<n>We train a total of 38 models ranging from 210 million to 1 billion parameters, and conduct over 15,000 fine-tuning and evaluation runs.<n>We find that while training with high-level CLM yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability.
arXiv Detail & Related papers (2025-07-01T17:45:48Z) - Boosting Large Language Models with Mask Fine-Tuning [60.56962908455601]
We introduce Mask Fine-Tuning (MFT) to show that properly breaking the integrity of the model can surprisingly lead to improved performance.
Experiments show that MFT gains a consistent performance boost across various domains and backbones.
arXiv Detail & Related papers (2025-03-27T20:17:57Z) - MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning [69.7347209018861]
We introduce MLLM-Selector, an automated approach that identifies valuable data for visual instruction tuning.
We calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance.
Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector.
arXiv Detail & Related papers (2025-03-26T12:42:37Z) - Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation [56.75665429851673]
This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment.
Experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%.
arXiv Detail & Related papers (2024-09-27T08:20:59Z) - MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs [47.94710556156627]
MIA-Bench is a benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions.
Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions.
arXiv Detail & Related papers (2024-07-01T17:53:35Z) - Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors.
We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z) - Learning to Decode Collaboratively with Multiple Language Models [37.31339648499042]
We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level.
Token-level collaboration during decoding allows for a fusion of each model's expertise in a manner tailored to the specific task at hand.
arXiv Detail & Related papers (2024-03-06T17:23:28Z) - Towards Robust Instruction Tuning on Multimodal Large Language Models [25.506776502317436]
In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks.
Results on two popular multimodal instructionfollowing benchmarks show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks.
arXiv Detail & Related papers (2024-02-22T12:35:50Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements [10.687101698324897]
Large language models demonstrate a remarkable capability for learning to solve new tasks from a few examples.
The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning.
We show that a poor choice of the template can reduce the performance of the strongest models and inference methods to a random guess level.
arXiv Detail & Related papers (2024-01-12T18:58:26Z) - Model ensemble instead of prompt fusion: a sample-specific knowledge
transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks.
We propose Sample-specific Ensemble of Source Models (SESoM)
SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.