Spanning Tree Autoregressive Visual Generation
- URL: http://arxiv.org/abs/2511.17089v1
- Date: Fri, 21 Nov 2025 09:45:17 GMT
- Title: Spanning Tree Autoregressive Visual Generation
- Authors: Sangkyu Lee, Changho Lee, Janghoon Han, Hosung Song, Tackgeun You, Hwasup Lim, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu,
- Abstract summary: We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance.
- Score: 51.7635842702602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.
Related papers
- SMKC: Sketch Based Kernel Correlation Images for Variable Cardinality Time Series Anomaly Detection [0.0]
In operational environments, monitoring systems frequently experience sensor churn.<n>We propose SMKC, a framework that decouples the dynamic input structure from the anomaly detector.<n>We find that a detector using random projections and nearest neighbors on the SMKC representation performs competitively with fully trained baselines.
arXiv Detail & Related papers (2026-01-28T21:15:11Z) - GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation [77.13582457917418]
We train a generative model solely on grid images comprising subsampled frames.<n>We learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames.<n>Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.
arXiv Detail & Related papers (2025-12-24T16:46:04Z) - Does the Manipulation Process Matter? RITA: Reasoning Composite Image Manipulations via Reversely-Ordered Incremental-Transition Autoregression [13.933194190556714]
We reformulate image manipulation localization as a conditional sequence prediction task, proposing the RITA framework.<n>RITA predicts manipulated regions layer-by-layer in an ordered manner, using each step's prediction as the condition for the next.<n>To enable training and evaluation, we synthesize multi-step manipulation data and construct a new benchmark HSIM.
arXiv Detail & Related papers (2025-09-24T11:25:44Z) - Latent Beam Diffusion Models for Generating Visual Sequences [16.1012766388174]
Existing methods generate each image independently, leading to disjointed narratives.<n>We introduce a novel beam search strategy for latent space exploration.<n>BeamDiffusion produces full sequences with superior coherence, visual continuity, and textual alignment.
arXiv Detail & Related papers (2025-03-26T11:01:10Z) - Autoregressive Image Generation with Randomized Parallel Decoding [28.352741116124538]
We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation.<n>ARPG achieves over a 30 times speedup in inference and a 75 percent reduction in memory consumption.<n>On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.83 with only 32 sampling steps.
arXiv Detail & Related papers (2025-03-13T17:19:51Z) - Warmup Generations: A Task-Agnostic Approach for Guiding Sequence-to-Sequence Learning with Unsupervised Initial State Generation [34.55224347308013]
Traditional supervised fine-tuning (SFT) strategies for sequence-to-sequence tasks often train models to directly generate the target output.<n>We introduce a task-agnostic framework that enables models to generate intermediate "upwarm" sequences.<n>We show that our approach outperforms traditional SFT methods, and offers a scalable and flexible solution for sequence-to-sequence tasks.
arXiv Detail & Related papers (2025-02-17T20:23:42Z) - Unsupervised Segmentation by Diffusing, Walking and Cutting [5.6872893893453105]
We propose an unsupervised image segmentation method using features from pre-trained text-to-image diffusion models.<n>A key insight is that self-attention probability distributions can be interpreted as a transition matrix for random walks across the image.<n>We show that our approach surpasses all existing methods for zero-shot unsupervised segmentation, achieving state-of-the-art results on COCO-Stuff-27 and Cityscapes.
arXiv Detail & Related papers (2024-12-06T00:23:18Z) - Made to Order: Discovering monotonic temporal changes via self-supervised video ordering [89.0660110757949]
We exploit a simple proxy task of ordering a shuffled image sequence, with time' serving as a supervisory signal.
We introduce a transformer-based model for ordering of image sequences of arbitrary length with built-in attribution maps.
arXiv Detail & Related papers (2024-04-25T17:59:56Z) - Non-autoregressive Sequence-to-Sequence Vision-Language Models [59.445765313094434]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.<n>The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z) - SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.