EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model
- URL: http://arxiv.org/abs/2509.26127v1
- Date: Tue, 30 Sep 2025 11:45:48 GMT
- Title: EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model
- Authors: Ruixiao Dong, Zhendong Wang, Keli Liu, Li Li, Ying Chen, Kai Li, Daowen Li, Houqiang Li,
- Abstract summary: EchoGen is a pioneering framework that empowers Visual Auto-Regressive ( VAR) models with subject-driven generation capabilities.<n>We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition.<n>To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models.
- Score: 56.53617289548353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds. Visual Auto-Regressive (VAR) models are renowned for their rapid sampling speeds and strong generative quality, making them an ideal yet underexplored foundation for resolving this tension. To bridge this gap, we introduce EchoGen, a pioneering framework that empowers VAR models with subject-driven generation capabilities. The core design of EchoGen is an effective dual-path injection strategy that disentangles a subject's high-level semantic identity from its low-level fine-grained details, enabling enhanced controllability and fidelity. We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition. Concurrently, a content encoder captures intricate visual details, which are integrated via a multi-modal attention mechanism to ensure high-fidelity texture and structural preservation. To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models. Both quantitative and qualitative results substantiate our design, demonstrating that EchoGen achieves subject fidelity and image quality comparable to state-of-the-art diffusion-based methods with significantly lower sampling latency. Code and models will be released soon.
Related papers
- StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z) - DreamVAR: Taming Reinforced Visual Autoregressive Model for High-Fidelity Subject-Driven Image Generation [108.71044040025374]
We present a novel framework for subject-driven image synthesis built upon a Visual Autoregressive model that employs next-scale prediction.<n>We show that Dreamthe achieves superior appearance preservation compared to leading diffusion-based methods.
arXiv Detail & Related papers (2026-01-30T03:32:29Z) - IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction [77.06211178777939]
IAR2 is an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process.<n>We show that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet.
arXiv Detail & Related papers (2025-10-08T12:08:21Z) - Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model [87.23753533733046]
We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities.<n>Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder.
arXiv Detail & Related papers (2025-05-29T16:15:48Z) - DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning [53.27049077100897]
generative pre-training has been shown to yield discriminative representations, paving the way towards unified visual generation and understanding.<n>This work introduces self-conditioning, a mechanism that internally leverages the rich semantics inherent in denoising network to guide its own decoding layers.<n>Results are compelling: our method boosts both generation FID and recognition accuracy with 1% computational overhead and generalizes across diverse diffusion architectures.
arXiv Detail & Related papers (2025-05-16T08:47:16Z) - Boosting Generative Image Modeling via Joint Image-Feature Synthesis [15.133906625258797]
We introduce a novel generative image modeling framework that seamlessly bridges the gap by leveraging a diffusion model to jointly model low-level image latents.<n>Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise.<n>By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance.
arXiv Detail & Related papers (2025-04-22T17:41:42Z) - RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning [54.07026389388881]
We present the first real-object-based retrieval-augmented generation framework (RealRAG)<n>RealRAG augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models.<n>Our framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation.
arXiv Detail & Related papers (2025-02-02T16:41:54Z) - HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection [4.908389661988192]
HFMF is a comprehensive two-stage deepfake detection framework.<n>It integrates vision Transformers and convolutional nets through a hierarchical feature fusion mechanism.<n>We demonstrate that our architecture achieves superior performance across diverse dataset benchmarks.
arXiv Detail & Related papers (2025-01-10T00:20:29Z) - DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets.
We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability.
Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z) - High-Fidelity Synthesis with Disentangled Representation [60.19657080953252]
We propose an Information-Distillation Generative Adrial Network (ID-GAN) for disentanglement learning and high-fidelity synthesis.
Our method learns disentangled representation using VAE-based models, and distills the learned representation with an additional nuisance variable to the separate GAN-based generator for high-fidelity synthesis.
Despite the simplicity, we show that the proposed method is highly effective, achieving comparable image generation quality to the state-of-the-art methods using the disentangled representation.
arXiv Detail & Related papers (2020-01-13T14:39:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.