AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
- URL: http://arxiv.org/abs/2405.14129v2
- Date: Sat, 23 Nov 2024 14:38:05 GMT
- Title: AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
- Authors: Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai,
- Abstract summary: Multimodal Large Language Models (MLLMs) are widely regarded as crucial in the exploration of Artificial General Intelligence (AGI)
The core of MLLMs lies in their capability to achieve cross-modal alignment.
Despite their success, there are shortcomings in the modeling of alignment capabilities within these models.
- Score: 26.181345324220743
- License:
- Abstract: Multimodal Large Language Models (MLLMs) are widely regarded as crucial in the exploration of Artificial General Intelligence (AGI). The core of MLLMs lies in their capability to achieve cross-modal alignment. To attain this goal, current MLLMs typically follow a two-phase training paradigm: the pre-training phase and the instruction-tuning phase. Despite their success, there are shortcomings in the modeling of alignment capabilities within these models. Firstly, during the pre-training phase, the model usually assumes that all image-text pairs are uniformly aligned, but in fact the degree of alignment between different image-text pairs is inconsistent. Secondly, the instructions currently used for finetuning incorporate a variety of tasks and different tasks usually require different levels of alignment capabilities, but previous MLLMs overlook these differentiated alignment needs. To tackle these issues, we propose a new multimodal large language model AlignGPT. In the pre-training stage, instead of treating all image-text pairs equally, we divide them into different groups according to the degrees of alignment of them. Then, the model is trained to learn the representations of different alignment levels. In the instruction-tuning phase, we adaptively combine these representations of alignment levels to meet the dynamic alignment needs of different tasks. Extensive experimental results show that our model achieves competitive performance on 12 benchmarks.
Related papers
- ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)
We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Gramian Multimodal Representation Learning and Alignment [5.793118803623239]
We present the novel Gramian Representation Alignment Measure (GRAM)
GRAM learns and aligns $n$ modalities directly in the higher-dimensional space in which modality embeddings lie.
The novel GRAM-based contrastive loss function enhances the alignment of multimodal models in the higher-dimensional embedding space.
arXiv Detail & Related papers (2024-12-16T16:41:51Z) - Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - Understanding Alignment in Multimodal LLMs: A Comprehensive Study [46.33812471516309]
We analyze each aspect of preference alignment in Multimodal Large Language Models (MLLMs)
We show that combining offline and online methods can improve the performance of the model in certain scenarios.
We introduce a novel way of creating multimodal preference data called Bias-Driven Hallucination Sampling (BDHS)
arXiv Detail & Related papers (2024-07-02T17:55:03Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Planting a SEED of Vision in Large Language Model [73.17530130368053]
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the ability to SEE and Draw at the same time.
This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs.
arXiv Detail & Related papers (2023-07-16T13:41:39Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic
Alignment [24.720485548282845]
We introduce concepts in both modalities to construct two-level semantic representations for language and vision.
We train the cross-modality model in two stages, namely, uni-modal learning and cross-modal learning.
Our model generates the-state-of-the-art results on several vision and language tasks.
arXiv Detail & Related papers (2022-01-29T14:30:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.