Improved Masked Image Generation with Knowledge-Augmented Token Representations
- URL: http://arxiv.org/abs/2511.12032v1
- Date: Sat, 15 Nov 2025 04:53:21 GMT
- Title: Improved Masked Image Generation with Knowledge-Augmented Token Representations
- Authors: Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Zihao Han, Yunming Ye,
- Abstract summary: Masked image generation (MIG) has demonstrated remarkable efficiency and high-fidelity images by enabling parallel token prediction.<n>We propose a novel Knowledge-Augmented Masked Image Generation framework, named KA-MIG, which introduces explicit knowledge of token-level semantic dependencies.<n>We show that our method effectively enhances the model's ability to capture semantic dependencies, leading to improved generation quality.
- Score: 25.53300917337226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked image generation (MIG) has demonstrated remarkable efficiency and high-fidelity images by enabling parallel token prediction. Existing methods typically rely solely on the model itself to learn semantic dependencies among visual token sequences. However, directly learning such semantic dependencies from data is challenging because the individual tokens lack clear semantic meanings, and these sequences are usually long. To address this limitation, we propose a novel Knowledge-Augmented Masked Image Generation framework, named KA-MIG, which introduces explicit knowledge of token-level semantic dependencies (\emph{i.e.}, extracted from the training data) as priors to learn richer representations for improving performance. In particular, we explore and identify three types of advantageous token knowledge graphs, including two positive and one negative graphs (\emph{i.e.}, the co-occurrence graph, the semantic similarity graph, and the position-token incompatibility graph). Based on three prior knowledge graphs, we design a graph-aware encoder to learn token and position-aware representations. After that, a lightweight fusion mechanism is introduced to integrate these enriched representations into the existing MIG methods. Resorting to such prior knowledge, our method effectively enhances the model's ability to capture semantic dependencies, leading to improved generation quality. Experimental results demonstrate that our method improves upon existing MIG for class-conditional image generation on ImageNet.
Related papers
- TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement [87.82338951215131]
TokenAR is a simple but effective token-level enhancement mechanism to address reference identity confusion problem.<n>Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens.<n>The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.
arXiv Detail & Related papers (2025-10-18T03:36:26Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - Importance-Based Token Merging for Efficient Image and Video Generation [41.94334394794811]
We show that preserving high-information tokens during merging significantly improves sample quality.<n>We propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation.
arXiv Detail & Related papers (2024-11-23T02:01:49Z) - Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering [33.05984601411495]
We propose a new perspective on image clustering, the pretrain-based Grid Jigsaw Representation (pGJR)<n>Inspired by human jigsaw puzzle processing, we modify the traditional jigsaw learning to gain a more sequential and incremental understanding of image structure.<n>Our experiments demonstrate that using the pretrained model as a feature extractor can accelerate the convergence of clustering.
arXiv Detail & Related papers (2023-10-27T03:07:05Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Adversarial Graph Contrastive Learning with Information Regularization [51.14695794459399]
Contrastive learning is an effective method in graph representation learning.
Data augmentation on graphs is far less intuitive and much harder to provide high-quality contrastive samples.
We propose a simple but effective method, Adversarial Graph Contrastive Learning (ARIEL)
It consistently outperforms the current graph contrastive learning methods in the node classification task over various real-world datasets.
arXiv Detail & Related papers (2022-02-14T05:54:48Z) - Joint Graph Learning and Matching for Semantic Feature Correspondence [69.71998282148762]
We propose a joint emphgraph learning and matching network, named GLAM, to explore reliable graph structures for boosting graph matching.
The proposed method is evaluated on three popular visual matching benchmarks (Pascal VOC, Willow Object and SPair-71k)
It outperforms previous state-of-the-art graph matching methods by significant margins on all benchmarks.
arXiv Detail & Related papers (2021-09-01T08:24:02Z) - Graph Contrastive Learning with Adaptive Augmentation [23.37786673825192]
We propose a novel graph contrastive representation learning method with adaptive augmentation.
Specifically, we design augmentation schemes based on node centrality measures to highlight important connective structures.
Our proposed method consistently outperforms existing state-of-the-art baselines and even surpasses some supervised counterparts.
arXiv Detail & Related papers (2020-10-27T15:12:21Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.