Grid: Omni Visual Generation
- URL: http://arxiv.org/abs/2412.10718v4
- Date: Tue, 21 Jan 2025 04:00:36 GMT
- Title: Grid: Omni Visual Generation
- Authors: Cong Wan, Xiangyang Luo, Hao Luo, Zijian Cai, Yiren Song, Yunlong Zhao, Yifan Bai, Yuhang He, Yihong Gong,
- Abstract summary: Current approaches either build specialized video models from scratch with enormous computational costs or add separate motion modules to image generators.<n>We observe that modern image generation models possess underutilized potential in handling structured layouts with implicit temporal understanding.<n>We introduce GRID, which reformulates temporal sequences as grid layouts, enabling holistic processing of visual sequences.
- Score: 29.363916460022427
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Visual generation has witnessed remarkable progress in single-image tasks, yet extending these capabilities to temporal sequences remains challenging. Current approaches either build specialized video models from scratch with enormous computational costs or add separate motion modules to image generators, both requiring learning temporal dynamics anew. We observe that modern image generation models possess underutilized potential in handling structured layouts with implicit temporal understanding. Building on this insight, we introduce GRID, which reformulates temporal sequences as grid layouts, enabling holistic processing of visual sequences while leveraging existing model capabilities. Through a parallel flow-matching training strategy with coarse-to-fine scheduling, our approach achieves up to 67 faster inference speeds while using <1/1000 of the computational resources compared to specialized models. Extensive experiments demonstrate that GRID not only excels in temporal tasks from Text-to-Video to 3D Editing but also preserves strong performance in image generation, establishing itself as an efficient and versatile omni-solution for visual generation.
Related papers
- EndoGen: Conditional Autoregressive Endoscopic Video Generation [51.97720772069513]
We propose the first conditional endoscopic video generation framework, namely EndoGen.<n>Specifically, we build an autoregressive model with a tailored Spatiotemporal Grid-Frame Patterning strategy.<n>We demonstrate the effectiveness of our framework in generating high-quality, conditionally guided endoscopic content.
arXiv Detail & Related papers (2025-07-23T10:32:20Z) - LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z) - LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer [36.51630912419451]
We propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model.<n>LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities.<n>Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed.
arXiv Detail & Related papers (2025-06-08T00:15:32Z) - Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis [12.160537328404622]
textttDRA-Ctrl provides new insights into reusing resource-intensive video models.<n>textttDRA-Ctrl lays foundation for future unified generative models across visual modalities.
arXiv Detail & Related papers (2025-05-29T10:34:45Z) - Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z) - RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models [22.042487298092883]
RealGeneral is a novel framework that reformulates image generation as a conditional frame prediction task.
It mitigates a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task.
arXiv Detail & Related papers (2025-03-13T14:31:52Z) - Learnable Infinite Taylor Gaussian for Dynamic View Rendering [55.382017409903305]
This paper introduces a novel approach based on a learnable Taylor Formula to model the temporal evolution of Gaussians.
The proposed method achieves state-of-the-art performance in this domain.
arXiv Detail & Related papers (2024-12-05T16:03:37Z) - VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model [34.35449902855767]
Two fundamental questions are what data we use for training and how to ensure multi-view consistency.
We propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models.
Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-18T17:48:15Z) - Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting [9.383423119196408]
We introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing multi-view diffusion models.
MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation.
In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations.
arXiv Detail & Related papers (2024-03-15T02:57:20Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z) - Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections.
We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing.
We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z) - LayoutDiffuse: Adapting Foundational Diffusion Models for
Layout-to-Image Generation [24.694298869398033]
Our method trains efficiently, generates images with both high perceptual quality and layout alignment.
Our method significantly outperforms other 10 generative models based on GANs, VQ-VAE, and diffusion models.
arXiv Detail & Related papers (2023-02-16T14:20:25Z) - TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual
Vision Transformer for Fast Arbitrary One-Shot Image Generation [11.207512995742999]
One-shot image generation (OSG) with generative adversarial networks that learn from the internal patches of a given image has attracted world wide attention.
We propose a novel structure-preserved method TcGAN with individual vision transformer to overcome the shortcomings of the existing one-shot image generation methods.
arXiv Detail & Related papers (2023-02-16T03:05:59Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Leveraging Image-based Generative Adversarial Networks for Time Series
Generation [4.541582055558865]
We propose a two-dimensional image representation for time series, the Extended Intertemporal Return Plot (XIRP)
Our approach captures the intertemporal time series dynamics in a scale-invariant and invertible way, reducing training time and improving sample quality.
arXiv Detail & Related papers (2021-12-15T11:55:11Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Optimization-Inspired Learning with Architecture Augmentations and
Control Mechanisms for Low-Level Vision [74.9260745577362]
This paper proposes a unified optimization-inspired learning framework to aggregate Generative, Discriminative, and Corrective (GDC) principles.
We construct three propagative modules to effectively solve the optimization models with flexible combinations.
Experiments across varied low-level vision tasks validate the efficacy and adaptability of GDC.
arXiv Detail & Related papers (2020-12-10T03:24:53Z) - DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image
Generation [8.26410341981427]
The Dual Attention Generative Adversarial Network (DTGAN) can synthesize high-quality and semantically consistent images.
The proposed model introduces channel-aware and pixel-aware attention modules that can guide the generator to focus on text-relevant channels and pixels.
A new type of visual loss is utilized to enhance the image resolution by ensuring vivid shape and perceptually uniform color distributions of generated images.
arXiv Detail & Related papers (2020-11-05T08:57:15Z) - Concurrently Extrapolating and Interpolating Networks for Continuous
Model Generation [34.72650269503811]
We propose a simple yet effective model generation strategy to form a sequence of models that only requires a set of specific-effect label images.
We show that the proposed method is capable of producing a series of continuous models and achieves better performance than that of several state-of-the-art methods for image smoothing.
arXiv Detail & Related papers (2020-01-12T04:44:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.