Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models
- URL: http://arxiv.org/abs/2602.09713v2
- Date: Mon, 16 Feb 2026 03:48:21 GMT
- Title: Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models
- Authors: Ruisi Zhao, Haoren Zheng, Zongxin Yang, Hehe Fan, Yi Yang,
- Abstract summary: Stroke3D is a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt.<n>To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes.
- Score: 53.32092058519587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton's graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE's decoder reconstructing the final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig and SKA-DPO, where we then synthesize a textured mesh conditioned on the generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh model by augmenting its training data with TextuRig: a dataset of textured and rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh alignment score, to further improve geometric fidelity. Together, our framework enables a more intuitive workflow for creating ready to animate 3D content. To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that Stroke3D produces plausible skeletons and high-quality meshes.
Related papers
- VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator [69.72818094722186]
A text-to-video generator can be combined with a 3D reconstruction system as a "decoder"<n>We introduce VIST3A, a general framework that does just that.<n>We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models.
arXiv Detail & Related papers (2025-10-15T11:55:08Z) - End-to-End Fine-Tuning of 3D Texture Generation using Differentiable Rewards [8.953379216683732]
We propose an end-to-end differentiable, reinforcement-learning-free framework that embeds human feedback, expressed as differentiable reward functions, directly into the 3D texture pipeline.<n>By back-propagating preference signals through both geometric and appearance modules, our method generates textures that respect the 3D geometry structure and align with desired criteria.
arXiv Detail & Related papers (2025-06-23T06:24:12Z) - Text-based Animatable 3D Avatars with Morphable Model Alignment [19.523681764512357]
We propose a novel framework, Anim3D, for text-based realistic animatable 3DGS avatar generation with morphable model alignment.<n>Our method outperforms existing approaches in terms of synthesis quality, alignment, and animation fidelity.
arXiv Detail & Related papers (2025-04-22T12:29:14Z) - RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos [50.37136267234771]
RigGS is a new paradigm that leverages 3D Gaussian representation and skeleton-based motion representation to model dynamic objects.<n>Our method can generate realistic new actions easily for objects and achieve high-quality rendering.
arXiv Detail & Related papers (2025-03-21T03:27:07Z) - HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset [6.978870586488504]
We present HumanRig, the first large-scale dataset specifically designed for 3D humanoid character rigging.<n>We introduce an innovative, data-driven automatic rigging framework, which overcomes the limitations of GNN-based methods.<n>This work not only remedies the dataset deficiency in rigging research but also propels the animation industry towards more efficient and automated character rigging pipelines.
arXiv Detail & Related papers (2024-12-03T09:33:00Z) - Enhancing Single Image to 3D Generation using Gaussian Splatting and Hybrid Diffusion Priors [17.544733016978928]
3D object generation from a single image involves estimating the full 3D geometry and texture of unseen views from an unposed RGB image captured in the wild.
Recent advancements in 3D object generation have introduced techniques that reconstruct an object's 3D shape and texture.
We propose bridging the gap between 2D and 3D diffusion models to address this limitation.
arXiv Detail & Related papers (2024-10-12T10:14:11Z) - GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction [52.04103235260539]
We present a diffusion model approach based on Gaussian Splatting representation for 3D object reconstruction from a single view.
The model learns to generate 3D objects represented by sets of GS ellipsoids.
The final reconstructed objects explicitly come with high-quality 3D structure and texture, and can be efficiently rendered in arbitrary views.
arXiv Detail & Related papers (2024-07-05T03:43:08Z) - ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image
Collections [71.46546520120162]
Estimating 3D articulated shapes like animal bodies from monocular images is inherently challenging.
We propose ARTIC3D, a self-supervised framework to reconstruct per-instance 3D shapes from a sparse image collection in-the-wild.
We produce realistic animations by fine-tuning the rendered shape and texture under rigid part transformations.
arXiv Detail & Related papers (2023-06-07T17:47:50Z) - Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from
Sparse Image Ensemble [72.3681707384754]
Hi-LASSIE performs 3D articulated reconstruction from only 20-30 online images in the wild without any user-defined shape or skeleton templates.
First, instead of relying on a manually annotated 3D skeleton, we automatically estimate a class-specific skeleton from the selected reference image.
Second, we improve the shape reconstructions with novel instance-specific optimization strategies that allow reconstructions to faithful fit on each instance.
arXiv Detail & Related papers (2022-12-21T14:31:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.