Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
- URL: http://arxiv.org/abs/2309.04669v3
- Date: Fri, 22 Mar 2024 05:41:55 GMT
- Title: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
- Authors: Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, Di Zhang, Wenwu Ou, Kun Gai, Yadong Mu,
- Abstract summary: We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
- Score: 52.935150075484074
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to both vision and language data. However, the prevailing approaches primarily regard the visual input as a prompt and focus exclusively on optimizing the text generation process conditioned upon vision content by a frozen LLM. Such an inequitable treatment of vision and language heavily constrains the model's potential. In this paper, we break through this limitation by representing both vision and language in a unified form. Specifically, we introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language that LLM can read. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. Coped with this tokenizer, the presented foundation model called LaVIT can handle both image and text indiscriminately under the same generative learning paradigm. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously. Extensive experiments further showcase that it outperforms the existing models by a large margin on massive vision-language tasks. Our code and models are available at https://github.com/jy0205/LaVIT.
Related papers
- HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding [91.0552157725366]
This paper presents a novel high-performance monolithic VLM named HoVLE.
It converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts.
Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks.
arXiv Detail & Related papers (2024-12-20T18:59:59Z) - Visual Lexicon: Rich Image Features in Language Space [99.94214846451347]
ViLex simultaneously captures rich semantic content and fine visual details.
ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model.
As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages.
arXiv Detail & Related papers (2024-12-09T18:57:24Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)
SeTok groups visual features into semantic units via a dynamic clustering algorithm.
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Beyond Text: Frozen Large Language Models in Visual Signal Comprehension [34.398976855955404]
Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, transforms an image into a foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model.
We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration.
arXiv Detail & Related papers (2024-03-12T17:59:51Z) - ViLTA: Enhancing Vision-Language Pre-training through Textual
Augmentation [35.05755930636518]
We propose ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs.
For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model.
For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input.
arXiv Detail & Related papers (2023-08-31T12:46:36Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Augmenting Vision Language Pretraining by Learning Codebook with Visual
Semantics [29.393661499333284]
We propose to "discretize" the visual representation by joint learning a codebook that imbues each visual token a semantic.
We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective.
Experiments validate the effectiveness of our approach across common vision-language benchmarks.
arXiv Detail & Related papers (2022-07-31T17:36:09Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.