BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
- URL: http://arxiv.org/abs/2404.17672v3
- Date: Fri, 2 Aug 2024 21:33:21 GMT
- Title: BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
- Authors: Ian Huang, Guandao Yang, Leonidas Guibas,
- Abstract summary: A vision-based edit generator and state evaluator work together to find the correct sequence of actions to achieve the goal.
Inspired by the role of visual imagination in the human design process, we supplement the visual reasoning capabilities of Vision-Language Models with "imagined" reference images.
- Score: 4.852796482609347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Graphics design is important for various applications, including movie production and game design. To create a high-quality scene, designers usually need to spend hours in software like Blender, in which they might need to interleave and repeat operations, such as connecting material nodes, hundreds of times. Moreover, slightly different design goals may require completely different sequences, making automation difficult. In this paper, we propose a system that leverages Vision-Language Models (VLMs), like GPT-4V, to intelligently search the design action space to arrive at an answer that can satisfy a user's intent. Specifically, we design a vision-based edit generator and state evaluator to work together to find the correct sequence of actions to achieve the goal. Inspired by the role of visual imagination in the human design process, we supplement the visual reasoning capabilities of VLMs with "imagined" reference images from image-generation models, providing visual grounding of abstract language descriptions. In this paper, we provide empirical evidence suggesting our system can produce simple but tedious Blender editing sequences for tasks such as editing procedural materials and geometry from text and/or reference images, as well as adjusting lighting configurations for product renderings in complex scenes.
Related papers
- PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions [66.92809850624118]
PixWizard is an image-to-image visual assistant designed for image generation, manipulation, and translation based on free-from language instructions.
We tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning dataset.
Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions.
arXiv Detail & Related papers (2024-09-23T17:59:46Z) - Alfie: Democratising RGBA Image Generation With No $$$ [33.334956022229846]
We propose a fully-automated approach for obtaining RGBA illustrations by modifying the inference-time behavior of a pre-trained Diffusion Transformer model.
We force the generation of entire subjects without sharp croppings, whose background is easily removed for seamless integration into design projects or artistic scenes.
arXiv Detail & Related papers (2024-08-27T07:13:44Z) - Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches [50.51643519253066]
3D Content Generation is at the heart of many computer graphics applications, including video gaming, film-making, virtual and augmented reality, etc.
This paper proposes a novel deep-learning based approach for automatically generating interactive and playable 3D game scenes.
arXiv Detail & Related papers (2024-08-08T16:27:37Z) - Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts [76.73043724587679]
We propose a dialogue-based 3D scene editing approach, termed CE3D.
Hash-Atlas represents 3D scene views, which transfers the editing of 3D scenes onto 2D atlas images.
Results demonstrate that CE3D effectively integrates multiple visual models to achieve diverse editing visual effects.
arXiv Detail & Related papers (2024-07-09T13:24:42Z) - Generative AI in Color-Changing Systems: Re-Programmable 3D Object Textures with Material and Design Constraints [13.440729439462014]
We discuss the possibilities of extending generative AI systems, with material and design constraints for reprogrammable surfaces with photochromic materials.
By constraining generative AI systems to colors and materials possible to be physically realized with photochromic dyes, we can create tools that would allow users to explore different viable patterns.
arXiv Detail & Related papers (2024-04-25T20:39:51Z) - Re-Thinking Inverse Graphics With Large Language Models [51.333105116400205]
Inverse graphics -- inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics.
We propose the Inverse-Graphics Large Language Model (IG-LLM), an inversegraphics framework centered around an LLM.
We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training.
arXiv Detail & Related papers (2024-04-23T16:59:02Z) - FLARE: Fast Learning of Animatable and Relightable Mesh Avatars [64.48254296523977]
Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems.
We introduce FLARE, a technique that enables the creation of animatable and relightable avatars from a single monocular video.
arXiv Detail & Related papers (2023-10-26T16:13:00Z) - UrbanGIRAFFE: Representing Urban Scenes as Compositional Generative
Neural Feature Fields [22.180286908121946]
We propose UrbanGIRAFFE, which uses a coarse 3D panoptic prior to guide a 3D-aware generative model.
Our model is compositional and controllable as it breaks down the scene into stuff, objects, and sky.
With proper loss functions, our approach facilitates photorealistic 3D-aware image synthesis with diverse controllability.
arXiv Detail & Related papers (2023-03-24T17:28:07Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.