Related papers: Concept-Centric Token Interpretation for Vector-Quantized Generative Models

Concept-Centric Token Interpretation for Vector-Quantized Generative Models

URL: http://arxiv.org/abs/2506.00698v1
Date: Sat, 31 May 2025 20:11:32 GMT
Title: Concept-Centric Token Interpretation for Vector-Quantized Generative Models
Authors: Tianze Yang, Yucheng Shi, Mengnan Du, Xuansheng Wu, Qiaoyu Tan, Jin Sun, Ninghao Liu,
Abstract summary: Concept-Oriented Token Explanation (CORTEX) is a novel approach for interpreting Vector-Quantized Generative Models (VQGMs)<n>Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens.
Score: 41.39170053556796
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vector-Quantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs -- the codebook of discrete tokens -- is still not well understood, e.g., which tokens are critical to generate an image of a certain concept? This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX's efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection. Our code is available at https://github.com/YangTianze009/CORTEX.

Related papers

Vector Quantized Latent Concepts: A Scalable Alternative to Clustering-Based Concept Discovery [10.6686798314267]
We propose the vector quantized latent concept (VQLC) method, a framework built upon the vector quantized-variational autoencoder (VQ-VAE) architecture.<n>We show that VQLC improves scalability while maintaining comparable quality of human-understandable explanations.
arXiv Detail & Related papers (2026-02-02T19:43:20Z)
TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement [87.82338951215131]
TokenAR is a simple but effective token-level enhancement mechanism to address reference identity confusion problem.<n>Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens.<n>The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.
arXiv Detail & Related papers (2025-10-18T03:36:26Z)
Hita: Holistic Tokenizer for Autoregressive Image Generation [56.81871174745175]
We introduce textitHita, a novel image tokenizer for autoregressive (AR) image generation.<n>It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens.
arXiv Detail & Related papers (2025-07-03T06:44:26Z)
Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction [4.900334213807624]
We show how to enjoy the benefits of large codebooks without making autoregressive modeling more difficult.<n>Our framework consists of two stages: (1) an autoregressive model that sequentially predicts coarse labels for each token in the sequence, and (2) an auxiliary model that simultaneously predicts fine-grained labels for all tokens conditioned on their coarse labels.
arXiv Detail & Related papers (2025-03-20T14:41:29Z)
GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting [64.84383010238908]
We propose an effective image tokenizer with 2D Gaussian Splatting as a solution.<n>In general, our framework integrates the local influence of 2D Gaussian distribution into the discrete space.<n> Competitive reconstruction performances on CIFAR, Mini-Net, and ImageNet-1K demonstrate the effectiveness of our framework.
arXiv Detail & Related papers (2025-01-26T17:56:11Z)
Image Understanding Makes for A Good Tokenizer for Image Generation [62.875788091204626]
We introduce a token-based IG framework, which relies on effective tokenizers to project images into token sequences. We show that tokenizers with strong IU capabilities achieve superior IG performance across a variety of metrics, datasets, tasks, and proposal networks.
arXiv Detail & Related papers (2024-11-07T03:55:23Z)
SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook [9.993066868670283]
We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning. Our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics.
arXiv Detail & Related papers (2024-09-09T23:12:43Z)
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z)
Subobject-level Image Tokenization [60.80949852899857]
Patch-based image tokenization ignores the morphology of the visual world.<n>Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation.<n>We show that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
arXiv Detail & Related papers (2024-02-22T06:47:44Z)
CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels [28.42405456691034]
We propose a two-stage strategy to facilitate a better visual representation in image re-identification tasks. The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID. The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks.
arXiv Detail & Related papers (2022-11-25T09:41:57Z)
Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.