Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction
- URL: http://arxiv.org/abs/2501.00880v1
- Date: Wed, 01 Jan 2025 15:58:51 GMT
- Title: Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction
- Authors: Teng Hu, Jiangning Zhang, Ran Yi, Jieyu Weng, Yabiao Wang, Xianfang Zeng, Zhucun Xue, Lizhuang Ma,
- Abstract summary: IAR is an Improved AutoRegressive Visual Generation Method.
We propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm.
We also propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located.
- Score: 52.09472099976885
- License:
- Abstract: Employing LLMs for visual generation has recently become a research focus. However, the existing methods primarily transfer the LLM architecture to visual generation but rarely investigate the fundamental differences between language and vision. This oversight may lead to suboptimal utilization of visual generation capabilities within the LLM framework. In this paper, we explore the characteristics of visual embedding space under the LLM framework and discover that the correlation between visual embeddings can help achieve more stable and robust generation results. We present IAR, an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Firstly, we propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm to rearrange the visual codebook into clusters, ensuring high similarity among visual features within each cluster. Leveraging the rearranged codebook, we propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located. This approach ensures that even if the model predicts the wrong token index, there is a high probability the predicted token is located in the correct cluster, which significantly enhances the generation quality and robustness. Extensive experiments demonstrate that our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID. Additionally, our approach can be applied to various LLM-based visual generation models and adheres to the scaling law, providing a promising direction for future research in LLM-based visual generation.
Related papers
- Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective [52.662463893268225]
Self-supervised heterogeneous graph learning (SHGL) has shown promising potential in diverse scenarios.
Existing SHGL methods encounter two significant limitations.
We introduce a novel framework enhanced by rank and dual consistency constraints.
arXiv Detail & Related papers (2024-12-01T09:33:20Z) - How to Make LLMs Strong Node Classifiers? [70.14063765424012]
Language Models (LMs) are challenging the dominance of domain-specific models, such as Graph Neural Networks (GNNs) and Graph Transformers (GTs)
We propose a novel approach that empowers off-the-shelf LMs to achieve performance comparable to state-of-the-art (SOTA) GNNs on node classification tasks.
arXiv Detail & Related papers (2024-10-03T08:27:54Z) - Towards Generative Class Prompt Learning for Fine-grained Visual Recognition [5.633314115420456]
Generative Class Prompt Learning and Contrastive Multi-class Prompt Learning are presented.
Generative Class Prompt Learning improves visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts.
CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation.
arXiv Detail & Related papers (2024-09-03T12:34:21Z) - Image Clustering Algorithm Based on Self-Supervised Pretrained Models and Latent Feature Distribution Optimization [4.39139858370436]
This paper introduces an image clustering algorithm based on self-supervised pretrained models and latent feature distribution optimization.
Our approach outperforms the latest clustering algorithms and achieves state-of-the-art clustering results.
arXiv Detail & Related papers (2024-08-04T04:08:21Z) - CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning [17.614980614656407]
We propose Continual Generative training for Incremental prompt-Learning.
We exploit Variational Autoencoders to learn class-conditioned distributions.
We show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities.
arXiv Detail & Related papers (2024-07-22T16:51:28Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - Dynamic Clustering and Cluster Contrastive Learning for Unsupervised
Person Re-identification [29.167783500369442]
Unsupervised Re-ID methods aim at learning robust and discriminative features from unlabeled data.
We propose a dynamic clustering and cluster contrastive learning (DCCC) method.
Experiments on several widely used public datasets validate the effectiveness of our proposed DCCC.
arXiv Detail & Related papers (2023-03-13T01:56:53Z) - Self-Supervised Class Incremental Learning [51.62542103481908]
Existing Class Incremental Learning (CIL) methods are based on a supervised classification framework sensitive to data labels.
When updating them based on the new class data, they suffer from catastrophic forgetting: the model cannot discern old class data clearly from the new.
In this paper, we explore the performance of Self-Supervised representation learning in Class Incremental Learning (SSCIL) for the first time.
arXiv Detail & Related papers (2021-11-18T06:58:19Z) - Consensus Clustering With Unsupervised Representation Learning [4.164845768197489]
We study the clustering ability of Bootstrap Your Own Latent (BYOL) and observe that features learnt using BYOL may not be optimal for clustering.
We propose a novel consensus clustering based loss function, and train BYOL with the proposed loss in an end-to-end way that improves the clustering ability and outperforms similar clustering based methods.
arXiv Detail & Related papers (2020-10-03T01:16:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.