Helix4D: Complex 4D Mesh Generation
Abstract Overview
Helix4D is a video-to-4D dynamic mesh generation framework that adapts the pretrained Trellis2 image-to-3D model to produce temporally consistent mesh sequences from object-centric videos. The paper targets difficult cases that prior methods struggle with, including topology changes, transparent or semi-transparent materials, thin structures, and inner surfaces. Its design combines sliding-window cross-frame attention with a first-frame anchor, first-frame conditioning from a frozen Trellis2 reconstruction, and a parameter-free 4D positional encoding that repurposes low-frequency spatial RoPE bands for time. The authors evaluate the method on ActionBench, a held-out TexVerse subset, and a new 52-video Helix4DBench emphasizing complex dynamics and materials.
Novelty
The main novelty is a systematic way to lift a strong static 3D foundation model into video-conditioned 4D mesh generation while preserving pretrained geometric and material capabilities. Technically, the paper introduces anchor-based sliding-window cross-frame attention and a parameter-free spatiotemporal RoPE that reallocates redundant low-frequency spatial bands to temporal encoding instead of adding new temporal parameters.
Results
Helix4D improves CD-3D by 3.8% over ActionMesh on ActionBench, and on the harder 52-video Helix4DBench it outperforms all reported baselines on every metric, including ULIP-2 and Uni3D by 5.7% and 7.8% over the strongest baseline. In user studies, it is preferred to the best-performing baseline in 67.9% of comparisons, and on a held-out TexVerse test set it achieves the best CD-3D and CD-4D among compared methods. Ablations further show that first-frame conditioning, the proposed 4D rotary embedding, and sliding-window-plus-anchor attention each contribute to quality and temporal consistency.
Key Points
- Helix4D extends Trellis2 from single-image 3D generation to video-conditioned 4D mesh generation while retaining support for non-watertight geometry, complex materials, and inner surfaces.
- The method uses sliding-window cross-frame attention with a first-frame anchor and first-frame conditioning so later frames can inherit strong static reconstruction priors efficiently.
- Across Helix4DBench, ActionBench, and a held-out TexVerse subset, the model reports the strongest overall quantitative results among compared baselines, especially on challenging topology and material changes.
References
- arXiv: https://arxiv.org/abs/2605.26109v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2605.26109v1
- Hugging Face Papers: https://huggingface.co/papers/2605.26109
- Project: https://snap-research.github.io/helix4d/