Related papers: $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

URL: http://arxiv.org/abs/2507.09144v2
Date: Sat, 02 Aug 2025 15:31:49 GMT
Title: $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting
Authors: Zhimin Liao, Ping Wei, Ruijie Zhang, Shuaijia Chen, Haoxuan Wang, Ziyang Ren,
Abstract summary: $I2$-World is an efficient framework for 4D occupancy forecasting.<n>Our method decouples scene tokenization into intra-scene and inter-scene tokenizers.<n>$I2$-World achieves state-of-the-art performance, outperforming existing methods by 25.1% in mIoU and 36.9% in IoU for 4D occupancy forecasting.
Score: 2.722128680610171
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that $I^{2}$-World achieves state-of-the-art performance, outperforming existing methods by 25.1\% in mIoU and 36.9\% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency: it requires merely 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on https://github.com/lzzzzzm/II-World.

Related papers

Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception [44.7850628565891]
Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction.<n>We develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages.<n>We show that PointATA can match or even outperform strong full fine-tuning models.
arXiv Detail & Related papers (2026-02-26T14:58:59Z)
SS4D: Native 4D Generative Model via Structured Spacetime Latents [50.29500511908054]
We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video.<n>We train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency.
arXiv Detail & Related papers (2025-12-16T10:45:06Z)
P-4DGS: Predictive 4D Gaussian Splatting with 90$\ imes$ Compression [26.130131551764077]
3D Gaussian Splatting (3DGS) has garnered significant attention due to its superior scene representation fidelity and real-time rendering performance.<n>Despite achieving promising results, most existing algorithms overlook the substantial temporal and spatial redundancies inherent in dynamic scenes.<n>We propose P-4DGS, a novel dynamic 3DGS representation for compact 4D scene modeling.
arXiv Detail & Related papers (2025-10-11T05:19:41Z)
H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers [124.11648300910444]
We present a hierarchical plug-and-play pruning-and-$-recovering framework, called Hierarchical Hourglass Tokenizer (H$_2$OT)<n>Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines.
arXiv Detail & Related papers (2025-09-08T17:59:59Z)
SAGOnline: Segment Any Gaussians Online [17.33447710659887]
3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for explicit 3D scene representation.<n>Current methods suffer from prohibitive computational costs, limited 3D spatial reasoning, and an inability to track multiple objects simultaneously.<n>We present Segment Any Gaussians Online (SAGOnline), a lightweight and zero-shot framework for real-time 3D segmentation in Gaussian scenes.
arXiv Detail & Related papers (2025-08-11T17:38:50Z)
Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding [24.964149224068027]
We propose Fast3D, a plug-and-play visual token pruning framework for 3D MLLMs.<n>Global Attention Prediction (GAP) learns to predict the global attention distributions of the target model, enabling efficient token importance estimation.<n>SAP, introduces dynamic token budgets through attention-based complexity assessment, automatically adjusting layer-wise pruning ratios.
arXiv Detail & Related papers (2025-07-12T16:29:02Z)
Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos [70.07088203106443]
Existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations.<n>Prior Masked Autoentangler (MAE) frameworks struggle to bridge the gap between low-level geometry and high-level dynamics in 4D data.<n>We propose a novel self-disentangled MAE for learning expressive,riminative, and transferable 4D representations.
arXiv Detail & Related papers (2025-04-07T08:47:36Z)
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z)
GaussRender: Learning 3D Occupancy with Gaussian Rendering [86.89653628311565]
GaussRender is a module that improves 3D occupancy learning by enforcing projective consistency.<n>Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent 3D structure.
arXiv Detail & Related papers (2025-02-07T16:07:51Z)
Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity [59.80405282381126]
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability.<n>We propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency.<n>SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
arXiv Detail & Related papers (2025-02-03T19:29:16Z)
4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [115.67081491747943]
Dynamic 3D scene representation and novel view synthesis are crucial for enabling AR/VR and metaverse applications.<n>We reformulate the reconstruction of a time-varying 3D scene as approximating its underlying 4D volume.<n>We derive several compact variants that effectively reduce the memory footprint to address its storage bottleneck.
arXiv Detail & Related papers (2024-12-30T05:30:26Z)
S4D: Streaming 4D Real-World Reconstruction with Gaussians and 3D Control Points [30.46796069720543]
We introduce a novel approach for streaming 4D real-world reconstruction utilizing discrete 3D control points. This method physically models local rays and establishes a motion-decoupling coordinate system. By effectively merging traditional graphics with learnable pipelines, it provides a robust and efficient local 6-degrees-of-freedom (6 DoF) motion representation.
arXiv Detail & Related papers (2024-08-23T12:51:49Z)
OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving [62.54220021308464]
We propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes.
arXiv Detail & Related papers (2024-05-30T17:59:42Z)
Dynamic 3D Point Cloud Sequences as 2D Videos [81.46246338686478]
3D point cloud sequences serve as one of the most common and practical representation modalities of real-world environments. We propose a novel generic representation called textitStructured Point Cloud Videos (SPCVs) SPCVs re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points.
arXiv Detail & Related papers (2024-03-02T08:18:57Z)
Compact 3D Scene Representation via Self-Organizing Gaussian Grids [10.816451552362823]
3D Gaussian Splatting has recently emerged as a highly promising technique for modeling of static 3D scenes. We introduce a compact scene representation organizing the parameters of 3DGS into a 2D grid with local homogeneity. Our method achieves a reduction factor of 17x to 42x in size for complex scenes with no increase in training time.
arXiv Detail & Related papers (2023-12-19T20:18:29Z)
NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs. We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels. We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.