Making LLaMA SEE and Draw with SEED Tokenizer
- URL: http://arxiv.org/abs/2310.01218v1
- Date: Mon, 2 Oct 2023 14:03:02 GMT
- Title: Making LLaMA SEE and Draw with SEED Tokenizer
- Authors: Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang and
Ying Shan
- Abstract summary: We introduce SEED, an elaborate image tokenizer that empowers Large Language Models with the ability to SEE and Draw.
With SEED tokens, LLM is able to perform scalable multimodal autoregression under its original training recipe.
SEED-LLaMA has exhibited compositional emergent abilities such as multi-turn in-context multimodal generation.
- Score: 69.1083058794092
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The great success of Large Language Models (LLMs) has expanded the potential
of multimodality, contributing to the gradual evolution of General Artificial
Intelligence (AGI). A true AGI agent should not only possess the capability to
perform predefined multi-tasks but also exhibit emergent abilities in an
open-world context. However, despite the considerable advancements made by
recent multimodal LLMs, they still fall short in effectively unifying
comprehension and generation tasks, let alone open-world emergent abilities. We
contend that the key to overcoming the present impasse lies in enabling text
and images to be represented and processed interchangeably within a unified
autoregressive Transformer. To this end, we introduce SEED, an elaborate image
tokenizer that empowers LLMs with the ability to SEE and Draw at the same time.
We identify two crucial design principles: (1) Image tokens should be
independent of 2D physical patch positions and instead be produced with a 1D
causal dependency, exhibiting intrinsic interdependence that aligns with the
left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens
should capture high-level semantics consistent with the degree of semantic
abstraction in words, and be optimized for both discriminativeness and
reconstruction during the tokenizer training phase. With SEED tokens, LLM is
able to perform scalable multimodal autoregression under its original training
recipe, i.e., next-word prediction. SEED-LLaMA is therefore produced by
large-scale pretraining and instruction tuning on the interleaved textual and
visual data, demonstrating impressive performance on a broad range of
multimodal comprehension and generation tasks. More importantly, SEED-LLaMA has
exhibited compositional emergent abilities such as multi-turn in-context
multimodal generation, acting like your AI assistant.
Related papers
- MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.
MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining [48.98105914356609]
Lumina-mGPT is a family of multimodal autoregressive models capable of various vision and language tasks.
We introduce Ominiponent Supervised Finetuning, transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification.
arXiv Detail & Related papers (2024-08-05T17:46:53Z) - MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis [18.876109299162138]
We introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE)
This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information, freezing the textual component while fine-tuning the visual component.
MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks.
arXiv Detail & Related papers (2024-07-10T12:52:49Z) - ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities.
We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG.
By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Planting a SEED of Vision in Large Language Model [73.17530130368053]
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the ability to SEE and Draw at the same time.
This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs.
arXiv Detail & Related papers (2023-07-16T13:41:39Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.