MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering
- URL: http://arxiv.org/abs/2510.04220v1
- Date: Sun, 05 Oct 2025 14:23:51 GMT
- Title: MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering
- Authors: Lixuan He, Shikang Zheng, Linfeng Zhang,
- Abstract summary: We propose a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure.<n>MASC is designed as a plug-and-play module, and our experiments validate its effectiveness.<n>It accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58.
- Score: 7.928163920344391
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. This conventional approach treats tokens as a flat vocabulary, disregarding the intrinsic structure of the token embedding space where proximity often correlates with semantic similarity. This oversight results in a highly complex prediction task, which hinders training efficiency and limits final generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure. MASC employs a novel geometry-aware distance metric and a density-driven agglomerative construction to model the underlying manifold of the token embeddings. By transforming the flat, high-dimensional prediction task into a structured, hierarchical one, MASC introduces a beneficial inductive bias that significantly simplifies the learning problem for the AR model. MASC is designed as a plug-and-play module, and our extensive experiments validate its effectiveness: it accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly competitive with state-of-the-art methods, establishing that structuring the prediction space is as crucial as architectural innovation for scalable generative modeling.
Related papers
- StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z) - FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation [50.71369329585773]
We introduce FACE, a novel Autoregressive Autoencoder framework that generates meshes at the face level.<n>Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token.<n> FACE achieves state-of-the-art reconstruction quality on standard benchmarks.
arXiv Detail & Related papers (2026-03-02T06:47:15Z) - MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation [20.14002849273559]
Unified multimodal models aim to integrate understanding and generation within a single framework.<n>We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework.<n>Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks.
arXiv Detail & Related papers (2025-11-23T03:25:39Z) - IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction [77.06211178777939]
IAR2 is an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process.<n>We show that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet.
arXiv Detail & Related papers (2025-10-08T12:08:21Z) - Towards Efficient General Feature Prediction in Masked Skeleton Modeling [59.46799426434277]
We propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling.<n>Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations.
arXiv Detail & Related papers (2025-09-03T18:05:02Z) - Self-Enhanced Image Clustering with Cross-Modal Semantic Consistency [57.961869351897384]
We propose a framework based on cross-modal semantic consistency for efficient image clustering.<n>Our framework first builds a strong foundation via Cross-Modal Semantic Consistency.<n>In the first stage, we train lightweight clustering heads to align with the rich semantics of the pre-trained model.<n>In the second stage, we introduce a Self-Enhanced fine-tuning strategy.
arXiv Detail & Related papers (2025-08-02T08:12:57Z) - Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate [0.0]
This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings.<n>We show that specialist models trained on disparate datasets can be merged into a single, more capable Mixture-of-Experts model.<n>We introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time.
arXiv Detail & Related papers (2025-07-08T20:01:15Z) - HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling [52.58723853697152]
We propose a Hybrid Architecture Distillation (HAD) approach for DNA sequence modeling.<n>We employ the NTv2-500M as the teacher model and devise a grouping masking strategy.<n>Compared to models with similar parameters, our model achieved excellent performance.
arXiv Detail & Related papers (2025-05-27T07:57:35Z) - Explaining the role of Intrinsic Dimensionality in Adversarial Training [31.495803865226158]
We show that off-manifold adversarial examples (AEs) enhance robustness, while on-manifold AEs improve generalization.<n>We introduce SMAAT, which improves the scalability of AT for encoder-based models by perturbing the layer with the lowest intrinsic dimensionality.<n>We validate SMAAT across multiple tasks, including text generation, sentiment classification, safety filtering, and retrieval augmented generation setups.
arXiv Detail & Related papers (2024-05-27T12:48:30Z) - UGMAE: A Unified Framework for Graph Masked Autoencoders [67.75493040186859]
We propose UGMAE, a unified framework for graph masked autoencoders.
We first develop an adaptive feature mask generator to account for the unique significance of nodes.
We then design a ranking-based structure reconstruction objective joint with feature reconstruction to capture holistic graph information.
arXiv Detail & Related papers (2024-02-12T19:39:26Z) - Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z) - MGAE: Masked Autoencoders for Self-Supervised Learning on Graphs [55.66953093401889]
Masked graph autoencoder (MGAE) framework to perform effective learning on graph structure data.
Taking insights from self-supervised learning, we randomly mask a large proportion of edges and try to reconstruct these missing edges during training.
arXiv Detail & Related papers (2022-01-07T16:48:07Z) - SSA: Semantic Structure Aware Inference for Weakly Pixel-Wise Dense
Predictions without Cost [36.27226683586425]
The semantic structure aware inference (SSA) is proposed to explore the semantic structure information hidden in different stages of the CNN-based network to generate high-quality CAM in the model inference.
The proposed method has the advantage of no parameters and does not need to be trained. Therefore, it can be applied to a wide range of weakly-supervised pixel-wise dense prediction tasks.
arXiv Detail & Related papers (2021-11-05T11:07:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.