CaMML: Context-Aware Multimodal Learner for Large Models
- URL: http://arxiv.org/abs/2401.03149v3
- Date: Wed, 19 Jun 2024 03:29:41 GMT
- Title: CaMML: Context-Aware Multimodal Learner for Large Models
- Authors: Yixin Chen, Shuai Zhang, Boran Han, Tong He, Bo Li,
- Abstract summary: We introduce Context-Aware MultiModal Learner (CaMML) for tuning large multimodal models (LMMs)
CaMML is crafted to seamlessly integrate multimodal contextual samples into large models, empowering the model to derive knowledge from analogous, domain-specific, up-to-date information.
Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks.
- Score: 16.30752006781618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we introduce Context-Aware MultiModal Learner (CaMML), for tuning large multimodal models (LMMs). CaMML, a lightweight module, is crafted to seamlessly integrate multimodal contextual samples into large models, thereby empowering the model to derive knowledge from analogous, domain-specific, up-to-date information and make grounded inferences. Importantly, CaMML is highly scalable and can efficiently handle lengthy multimodal context examples owing to its hierarchical design. Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks. Remarkably, CaMML-13B achieves the state-of-the-art performance on over ten widely recognized multimodal benchmark datasets, surpassing LLaVA-1.5 (13B) with a noticeable margin, without integration of any external resources. Moreover, we have conducted extensive ablative studies to inspect the inner workings of CaMML and performed qualitative analyses to showcase its effectiveness in handling real-world challenging cases. Code and models are available at: https://github.com/amazon-science/camml.
Related papers
- Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization process to enhance the multimodal reasoning capabilities of MLLMs.
We develop a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
Our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B.
arXiv Detail & Related papers (2024-11-15T18:59:27Z) - VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models [32.10766568096317]
This paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs.
VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner.
arXiv Detail & Related papers (2024-05-27T08:12:00Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Imp: Highly Capable Large Multimodal Models for Mobile Devices [19.328141787433704]
Large language models (LLMs) have shown remarkable versatility in open-world multimodal understanding.
They are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios.
In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data.
Based on our findings, we obtain Imp -- a family of highly capable LMMs at the 2B-4B scales.
arXiv Detail & Related papers (2024-05-20T15:23:19Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - VL-Mamba: Exploring State Space Models for Multimodal Learning [22.701028299912398]
In this work, we propose VL-Mamba, a multimodal large language model based on state space models.
Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model.
arXiv Detail & Related papers (2024-03-20T13:48:50Z) - Noise-powered Multi-modal Knowledge Graph Representation Framework [52.95468915728721]
The rise of Multi-modal Pre-training highlights the necessity for a unified Multi-Modal Knowledge Graph representation learning framework.
We propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking.
Our approach achieves SOTA performance across a total of ten datasets, demonstrating its versatility.
arXiv Detail & Related papers (2024-03-11T15:48:43Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.