Related papers: CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models

CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models

URL: http://arxiv.org/abs/2407.20454v1
Date: Mon, 29 Jul 2024 23:18:55 GMT
Title: CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models
Authors: Junda Wu, Xintong Li, Tong Yu, Yu Wang, Xiang Chen, Jiuxiang Gu, Lina Yao, Jingbo Shang, Julian McAuley,
Abstract summary: In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives. Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance. In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
Score: 68.64605538559312
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction tuning in multimodal large language models (MLLMs) aims to smoothly integrate a backbone LLM with a pre-trained feature encoder for downstream tasks. The major challenge is how to efficiently find the synergy through cooperative learning where LLMs adapt their reasoning abilities in downstream tasks while feature encoders adjust their encoding to provide more relevant modal information. In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives, where we find unbalanced learning between the two components, i.e., the feature encoder and the LLM, can cause diminishing learning gradients that slow the model convergence and often lead to sub-optimal results due to insufficient learning. Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance, based on which we further design a dynamic learning scheduler that better coordinates the learning. In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs considering the learning state of each model component, which potentially prevents each component from gradient diminishing and enables a more accurate estimation of the learning balance coefficient. We conduct experiments with multiple LLM backbones and feature encoders, where our techniques are model-agnostic and can be generically integrated with various MLLM backbones. Experiment results on multiple downstream tasks and modalities in vision and audio, demonstrate the proposed method's better efficiency and effectiveness in MLLM instruction tuning.

Related papers

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z)
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints [100.02131897927484]
This paper focuses on the native training of Multimodal Large Language Models (MLLMs) in an end-to-end manner.<n>We propose a native MLLM called NaViL, combined with a simple and cost-effective recipe.<n> Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs.
arXiv Detail & Related papers (2025-10-09T17:59:37Z)
Flipping Knowledge Distillation: Leveraging Small Models' Expertise to Enhance LLMs in Text Matching [16.725632407644884]
We introduce a flipped knowledge distillation paradigm, where a Large Language Model learns from a Smaller Language Model.<n>Specifically, we address the architectural gap between decoder-only LLMs and smaller encoder-based models.<n> Experiments on financial and healthcare benchmarks, as well as real-world applications, confirm its effectiveness.
arXiv Detail & Related papers (2025-07-08T02:54:15Z)
Enhancing Complex Instruction Following for Large Language Models with Mixture-of-Contexts Fine-tuning [13.56631686493347]
Post-training large language models (LLMs) may struggle to consistently follow complex instructions.<n>We propose transforming sequentially structured input instruction into multiple parallel instructions containing subcontexts.<n>MISO introduces a mixture-of-contexts paradigm that jointly considers the overall instruction-output alignment and the influence of individual sub-contexts to enhance SFT effectiveness.
arXiv Detail & Related papers (2025-05-17T09:13:47Z)
MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation [24.200547898713126]
Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data. Their real-world deployment is hindered by substantial computational and storage demands. We propose a Mixture-of-Layers Vision-Language-Action model (MoLe) architecture for dynamic LLM layer activation.
arXiv Detail & Related papers (2025-03-26T10:05:38Z)
Efficient Dynamic Ensembling for Multiple LLM Experts [44.41847678666002]
Ensemble reasoning for the strengths of different LLM experts is critical to achieving consistent and satisfactory performance on diverse inputs.<n>We propose an efficient Dynamic Ensemble Reasoning paradigm, called DER, to integrate the strengths of multiple LLM experts conditioned on dynamic inputs.<n> Experiments demonstrate that our method uses fewer computational resources to achieve better performance compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-12-10T12:05:56Z)
Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks? [6.7065734065794835]
We introduce a novel learning paradigm termed MLLM4WTAL. It harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors. It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR)
arXiv Detail & Related papers (2024-11-13T09:37:24Z)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM. We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning [35.446870721902904]
Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis.
arXiv Detail & Related papers (2024-10-02T23:25:17Z)
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning [15.919493497867567]
This study aims to evaluate the performance of Multimodal Large Language Models (MLLMs) on the VALSE benchmark. We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in model size and pretraining datasets.
arXiv Detail & Related papers (2024-07-17T11:26:47Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
New Solutions on LLM Acceleration, Optimization, and Application [14.995654657013741]
Large Language Models (LLMs) have become extremely potent instruments with exceptional capacities for comprehending and producing human-like text in a range of applications. However, the increasing size and complexity of LLMs present significant challenges in both training and deployment. We provide a review of recent advancements and research directions aimed at addressing these challenges.
arXiv Detail & Related papers (2024-06-16T11:56:50Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction [24.675876324457747]
Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA, may compromise the innate abilities of LLMs. We propose LLaMA-Excitor, a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile information. LLaMA-Excitor is the only method that maintains basic capabilities while achieving a significant improvement.
arXiv Detail & Related papers (2024-04-01T04:39:21Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
Towards Modeling Learner Performance with Large Language Models [7.002923425715133]
This paper investigates whether the pattern recognition and sequence modeling capabilities of LLMs can be extended to the domain of knowledge tracing. We compare two approaches to using LLMs for this task, zero-shot prompting and model fine-tuning, with existing, non-LLM approaches to knowledge tracing. While LLM-based approaches do not achieve state-of-the-art performance, fine-tuned LLMs surpass the performance of naive baseline models and perform on par with standard Bayesian Knowledge Tracing approaches.
arXiv Detail & Related papers (2024-02-29T14:06:34Z)
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity [56.30595787061546]
We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM) Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
arXiv Detail & Related papers (2024-02-13T23:25:04Z)
On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.