Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation
- URL: http://arxiv.org/abs/2509.18824v1
- Date: Tue, 23 Sep 2025 09:12:46 GMT
- Title: Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation
- Authors: Yanzuo Lu, Xin Xia, Manlin Zhang, Huafeng Kuang, Jianbin Zheng, Yuxi Ren, Xuefeng Xiao,
- Abstract summary: Hyper-Bagel is designed to simultaneously speed up both multimodal understanding and generation tasks.<n>For generative tasks, our resulting 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing.
- Score: 19.010105652612616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.
Related papers
- LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation [35.01134463094784]
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems.<n>Existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this.<n>This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap.
arXiv Detail & Related papers (2025-12-29T16:17:36Z) - Growing Visual Generative Capacity for Pre-Trained MLLMs [60.826355079902505]
Bridge is a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability.<n>We propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens.
arXiv Detail & Related papers (2025-10-02T00:40:02Z) - Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z) - Lavida-O: Elastic Masked Diffusion Models for Unified Multimodal Understanding and Generation [63.50827603618498]
Lavida-O is a unified MDM capable of image understanding and generation tasks.<n>It exhibits many new capabilities such as object grounding, image-editing, and high-resolution (1024px) image synthesis.
arXiv Detail & Related papers (2025-09-23T17:05:46Z) - CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers [72.23291099555459]
Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures.<n>This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism.<n>ChoRDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation.
arXiv Detail & Related papers (2025-07-21T05:48:47Z) - LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer [36.51630912419451]
We propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model.<n>LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities.<n>Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed.
arXiv Detail & Related papers (2025-06-08T00:15:32Z) - Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance [36.99310116405025]
Long-duration synthesis faces persistent challenges in simultaneously achieving high quality, portrait and temporal consistency, and computational efficiency.<n>Here, we present LetsTalk, a transformer diffusion framework that incorporates multimodal guidance and a novel memory bank mechanism.<n>Experiments demonstrate that LetsTalk achieves temporal coherent and realistic talking videos with enhanced diversity and liveliness, while maintaining remarkable efficiency with 8 fewer parameters than previous approaches.
arXiv Detail & Related papers (2024-11-24T04:46:00Z) - Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures [12.703947839247693]
Diffusion models, emerging as powerful deep generative tools, excel in various applications.
However, their remarkable generative performance is hindered by slow training and sampling.
This is due to the necessity of tracking extensive forward and reverse diffusion trajectories.
We present a multi-stage framework inspired by our empirical findings to tackle these challenges.
arXiv Detail & Related papers (2023-12-14T17:48:09Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.