What Makes for Good Visual Tokenizers for Large Language Models?
- URL: http://arxiv.org/abs/2305.12223v2
- Date: Tue, 23 May 2023 10:35:35 GMT
- Title: What Makes for Good Visual Tokenizers for Large Language Models?
- Authors: Guangzhi Wang, Yixiao Ge, Xiaohan Ding, Mohan Kankanhalli, Ying Shan
- Abstract summary: We investigate proper pre-training methods to build good visual tokenizers, making Large Language Models (LLMs) powerful Multimodal Large Language Models (MLLMs)
We discuss different visual tokenizers pre-trained with dominant methods (i.e., DeiT, CLIP, MAE, DINO)
We obtain a new MLLM equipped with a tailored Good Visual Tokenizer (GVT), which exhibits strong visual comprehension capability at multiple scales.
- Score: 26.488269091290597
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We empirically investigate proper pre-training methods to build good visual
tokenizers, making Large Language Models (LLMs) powerful Multimodal Large
Language Models (MLLMs). In our benchmark, which is curated to evaluate MLLMs
visual semantic understanding and fine-grained perception capabilities, we
discussed different visual tokenizers pre-trained with dominant methods (i.e.,
DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models
capture more semantics than self-supervised models, but the gap is narrowed by
scaling up the pre-training dataset. ii) Self-supervised models are better at
fine-grained perception, where patch-level supervision is particularly
effective. iii) Tuning the visual tokenizer leads to the loss of semantics
obtained from large-scale pretraining, which is unfavorable with relatively
small-scale instruction-tuning dataset. Given the findings, we reviewed methods
that attempted to unify semantics and fine-grained visual understanding, e.g.,
patch-level feature distillation with semantically-rich targets. We obtain an
intriguing insight mask-based strategies that were once all the rage may not be
applicable for obtaining good visual tokenizers. Based on this critical
observation, we obtain a new MLLM equipped with a tailored Good Visual
Tokenizer (GVT), which exhibits strong visual comprehension capability at
multiple scales. In particular, without introducing extra parameters and
task-specific fine-tuning, GVT achieves superior performance on visual question
answering, image captioning, and other fine-grained visual understanding tasks
such as object counting and multi-class identification.
Related papers
- Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent [72.1517476116743]
Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets.
Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue.
We introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation forgetting.
We propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations.
arXiv Detail & Related papers (2025-02-17T12:26:34Z) - Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning [125.79428219851289]
Inst-IT is a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning.
Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm.
arXiv Detail & Related papers (2024-12-04T18:58:10Z) - MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding [6.538592344967826]
We introduce MUSE-VL, a Unified Vision-Language Model Semantic through discrete -language for multimodal understanding and generation.
The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.
arXiv Detail & Related papers (2024-11-26T03:33:52Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Rethinking Visual Prompt Learning as Masked Visual Token Modeling [106.71983630652323]
We propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction.
VPTM is the first visual prompt method on the generative pre-trained visual model, which achieves consistency between pre-training and downstream visual classification by task reformulation.
arXiv Detail & Related papers (2023-03-09T02:43:10Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.