PromptVFX: Text-Driven Fields for Open-World 3D Gaussian Animation
- URL: http://arxiv.org/abs/2506.01091v1
- Date: Sun, 01 Jun 2025 17:22:59 GMT
- Title: PromptVFX: Text-Driven Fields for Open-World 3D Gaussian Animation
- Authors: Mert Kiray, Paul Uhlenbruck, Nassir Navab, Benjamin Busam,
- Abstract summary: We reformulate 3D animation as a field prediction task and introduce a text-driven framework that infers a time-varying 4D flow field acting on 3D Gaussians.<n>By leveraging large language models (LLMs) and vision-language models (VLMs) for function generation, our approach interprets arbitrary prompts and instantly updates color, opacity, and positions of 3D Gaussians in real time.
- Score: 49.91188543847175
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Visual effects (VFX) are key to immersion in modern films, games, and AR/VR. Creating 3D effects requires specialized expertise and training in 3D animation software and can be time consuming. Generative solutions typically rely on computationally intense methods such as diffusion models which can be slow at 4D inference. We reformulate 3D animation as a field prediction task and introduce a text-driven framework that infers a time-varying 4D flow field acting on 3D Gaussians. By leveraging large language models (LLMs) and vision-language models (VLMs) for function generation, our approach interprets arbitrary prompts (e.g., "make the vase glow orange, then explode") and instantly updates color, opacity, and positions of 3D Gaussians in real time. This design avoids overheads such as mesh extraction, manual or physics-based simulations and allows both novice and expert users to animate volumetric scenes with minimal effort on a consumer device even in a web browser. Experimental results show that simple textual instructions suffice to generate compelling time-varying VFX, reducing the manual effort typically required for rigging or advanced modeling. We thus present a fast and accessible pathway to language-driven 3D content creation that can pave the way to democratize VFX further.
Related papers
- GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond [56.677984098204696]
multimodal language models are driving the development of 3D Vision-Language Models (VLMs)<n>We propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations.<n>We present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images.
arXiv Detail & Related papers (2025-07-01T15:52:59Z) - SplatTalk: 3D VQA with Gaussian Splatting [13.211810095081159]
Language-guided 3D scene understanding is important for advancing applications in robotics, AR/VR, and human-computer interaction.<n>We introduce SplatTalk, a novel method that uses a generalizable 3D Gaussian Splatting (3DGS) framework to produce 3D tokens suitable for direct input into a pretrained LLM.
arXiv Detail & Related papers (2025-03-08T16:31:48Z) - Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models [54.35214051961381]
3D meshes are widely used in computer vision and graphics for their efficiency in animation and minimal memory use in movies, games, AR, and VR.<n>However, creating temporal consistent and realistic textures for mesh remains labor-intensive for professional artists.<n>We present 3D Tex sequences that integrates inherent geometry from mesh sequences with video diffusion models to produce consistent textures.
arXiv Detail & Related papers (2024-10-14T17:59:59Z) - Magic3DSketch: Create Colorful 3D Models From Sketch-Based 3D Modeling Guided by Text and Language-Image Pre-Training [2.9600148687385786]
Traditional methods like Computer-Aided Design (CAD) are often too labor-intensive and skill-demanding, making it challenging for novice users.
Our proposed method, Magic3DSketch, employs a novel technique that encodes sketches to predict a 3D mesh, guided by text descriptions.
Our method is also more useful and offers higher degree of controllability compared to existing text-to-3D approaches.
arXiv Detail & Related papers (2024-07-27T09:59:13Z) - iHuman: Instant Animatable Digital Humans From Monocular Videos [16.98924995658091]
We present a fast, simple, yet effective method for creating animatable 3D digital humans from monocular videos.
This work achieves and illustrates the need of accurate 3D mesh-type modelling of the human body.
Our method is faster by an order of magnitude (in terms of training time) than its closest competitor.
arXiv Detail & Related papers (2024-07-15T18:51:51Z) - MotionDreamer: Exploring Semantic Video Diffusion features for Zero-Shot 3D Mesh Animation [10.263762787854862]
We propose a technique for automatic re-animation of various 3D shapes based on a motion prior extracted from a video diffusion model.
We leverage an explicit mesh-based representation compatible with existing computer-graphics pipelines.
Our time-efficient zero-shot method achieves a superior performance re-animating a diverse set of 3D shapes.
arXiv Detail & Related papers (2024-05-30T15:30:38Z) - BAGS: Building Animatable Gaussian Splatting from a Monocular Video with Diffusion Priors [28.033944084753035]
We propose a method to build animatable 3D Gaussian Splatting from monocular video with diffusion priors.
The 3D Gaussian representations significantly accelerate the training and rendering process, and the diffusion priors allow the method to learn 3D models with limited viewpoints.
We perform an extensive evaluation across various real-world videos, demonstrating its superior performance compared to the current state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T02:44:46Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - 3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive
Physics under Challenging Scenes [68.66237114509264]
We present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids.
We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space.
arXiv Detail & Related papers (2023-04-22T19:28:49Z) - 3D-CLFusion: Fast Text-to-3D Rendering with Contrastive Latent Diffusion [55.71215821923401]
We tackle the task of text-to-3D creation with pre-trained latent-based NeRFs (NeRFs that generate 3D objects given input latent code)
We propose a novel method named 3D-CLFusion which leverages the pre-trained latent-based NeRFs and performs fast 3D content creation in less than a minute.
arXiv Detail & Related papers (2023-03-21T15:38:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.