LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark
- URL: http://arxiv.org/abs/2306.06687v3
- Date: Mon, 6 Nov 2023 07:02:19 GMT
- Title: LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark
- Authors: Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai
Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang
- Abstract summary: We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
- Score: 81.42376626294812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have emerged as a promising approach towards achieving
general-purpose AI agents. The thriving open-source LLM community has greatly
accelerated the development of agents that support human-machine dialogue
interaction through natural language processing. However, human interaction
with the world extends beyond only text as a modality, and other modalities
such as vision are also crucial. Recent works on multi-modal large language
models, such as GPT-4V and Bard, have demonstrated their effectiveness in
handling visual modalities. However, the transparency of these works is limited
and insufficient to support academic research. To the best of our knowledge, we
present one of the very first open-source endeavors in the field, LAMM,
encompassing a Language-Assisted Multi-Modal instruction tuning dataset,
framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem
for training and evaluating MLLMs, with a specific focus on facilitating AI
agents capable of bridging the gap between ideas and execution, thereby
enabling seamless human-AI interaction. Our main contribution is three-fold: 1)
We present a comprehensive dataset and benchmark, which cover a wide range of
vision tasks for 2D and 3D vision. Extensive experiments validate the
effectiveness of our dataset and benchmark. 2) We outline the detailed
methodology of constructing multi-modal instruction tuning datasets and
benchmarks for MLLMs, enabling rapid scaling and extension of MLLM research to
diverse domains, tasks, and modalities. 3) We provide a primary but potential
MLLM training framework optimized for modality extension. We also provide
baseline models, comprehensive experimental observations, and analysis to
accelerate future research. Our baseline model is trained within 24 A100 GPU
hours, framework supports training with V100 and RTX3090 is available thanks to
the open-source society.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.
Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts.
To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.
Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.
We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - ModaVerse: Efficiently Transforming Modalities with LLMs [25.49713745405194]
We introduce ModaVerse, a Multi-modal Large Language Model capable of comprehending and transforming content across various modalities.
We propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language.
arXiv Detail & Related papers (2024-01-12T06:28:54Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.