HiMat: DiT-based Ultra-High Resolution SVBRDF Generation
- URL: http://arxiv.org/abs/2508.07011v4
- Date: Tue, 07 Oct 2025 01:56:25 GMT
- Title: HiMat: DiT-based Ultra-High Resolution SVBRDF Generation
- Authors: Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang,
- Abstract summary: HiMat is a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation.<n>CrossStitch is a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention.
- Score: 26.081964370337943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Creating ultra-high-resolution spatially varying bidirectional reflectance functions (SVBRDFs) is critical for photorealistic 3D content creation, to faithfully represent fine-scale surface details required for close-up rendering. However, achieving 4K generation faces two key challenges: (1) the need to synthesize multiple reflectance maps at full resolution, which multiplies the pixel budget and imposes prohibitive memory and computational cost, and (2) the requirement to maintain strong pixel-level alignment across maps at 4K, which is particularly difficult when adapting pretrained models designed for the RGB image domain. We introduce HiMat, a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation. To address the first challenge, HiMat performs generation in a high-compression latent space via DC-AE, and employs a pretrained diffusion transformer with linear attention to improve per-map efficiency. To address the second challenge, we propose CrossStitch, a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention. Our experiments show that HiMat achieves high-fidelity 4K SVBRDF generation with superior efficiency, structural consistency, and diversity compared to prior methods. Beyond materials, our framework also generalizes to related applications such as intrinsic decomposition.
Related papers
- Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing [62.94394079771687]
A burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents.<n>We propose a systematic framework to adapt understanding-oriented encoder features for generative tasks.<n>We show that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both Text-to-Image (T2I) and image editing tasks.
arXiv Detail & Related papers (2025-12-19T18:59:57Z) - Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention [50.391914489898774]
Scale-DiT is a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance.<n>A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail.<n>Experiments demonstrate that Scale-DiT achieves more than $2times$ faster inference and lower memory usage.
arXiv Detail & Related papers (2025-10-18T03:15:26Z) - High-resolution Photo Enhancement in Real-time: A Laplacian Pyramid Network [73.19214585791268]
This paper introduces a pyramid network called LLF-LUT++, which integrates global and local operators through closed-form Laplacian pyramid decomposition and reconstruction.<n>Specifically, we utilize an image-adaptive 3D LUT that capitalizes on the global tonal characteristics of downsampled images.<n>LLF-LUT++ not only achieves a 2.64 dB improvement in PSNR on the HDR+ dataset, but also further reduces, with 4K resolution images processed in just 13 ms on a single GPU.
arXiv Detail & Related papers (2025-10-13T16:52:32Z) - Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers [45.701222598522456]
Pixel-Perfect Depth is a monocular depth estimation model based on pixel-space diffusion generation.<n>Our model achieves the best performance among all published generative models across five benchmarks.
arXiv Detail & Related papers (2025-10-08T17:59:33Z) - EfficienT-HDR: An Efficient Transformer-Based Framework via Multi-Exposure Fusion for HDR Reconstruction [0.0]
This study proposes a light-weight Vision Transformer architecture designed explicitly for HDR reconstruction.<n>It employs an Intersection-Aware Adaptive Fusion module to suppress ghosting effectively.<n> Experimental results demonstrate that, compared to the baseline, the main version reduces FLOPS by approximately 67%.
arXiv Detail & Related papers (2025-09-24T06:01:37Z) - Look-Around Before You Leap: High-Frequency Injected Transformer for Image Restoration [46.96362010335177]
In this paper, we propose HIT, a simple yet effective High-frequency Injected Transformer for image restoration.
Specifically, we design a window-wise injection module (WIM), which incorporates abundant high-frequency details into the feature map, to provide reliable references for restoring high-quality images.
In addition, we introduce a spatial enhancement unit (SEU) to preserve essential spatial relationships that may be lost due to the computations carried out across channel dimensions in the BIM.
arXiv Detail & Related papers (2024-03-30T08:05:00Z) - VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies.
We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - Probabilistic-based Feature Embedding of 4-D Light Fields for
Compressive Imaging and Denoising [62.347491141163225]
4-D light field (LF) poses great challenges in achieving efficient and effective feature embedding.
We propose a probabilistic-based feature embedding (PFE), which learns a feature embedding architecture by assembling various low-dimensional convolution patterns.
Our experiments demonstrate the significant superiority of our methods on both real-world and synthetic 4-D LF images.
arXiv Detail & Related papers (2023-06-15T03:46:40Z) - Joint Super-Resolution and Inverse Tone-Mapping: A Feature Decomposition
Aggregation Network and A New Benchmark [0.0]
We propose a lightweight Feature Decomposition Aggregation Network (FDAN) to exploit the potential power of decomposition mechanism.
In particular, we design a Feature Decomposition Block (FDB) to achieve learnable separation of detail and base feature maps.
We also collect a large-scale dataset for joint SR-ITM, i.e., SRITM-4K, which provides versatile scenarios for robust model training and evaluation.
arXiv Detail & Related papers (2022-07-07T15:16:36Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - Improved Transformer for High-Resolution GANs [69.42469272015481]
We introduce two key ingredients to Transformer to address this challenge.
We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 times 128$ and FFHQ $256 times 256$, respectively.
arXiv Detail & Related papers (2021-06-14T17:39:49Z) - ResT: An Efficient Transformer for Visual Recognition [5.807423409327807]
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition.
We show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.
arXiv Detail & Related papers (2021-05-28T08:53:54Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image
Segmentation [95.51455777713092]
Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation.
We propose a novel framework that efficiently bridges a bf Convolutional neural network and a bf Transformer bf (CoTr) for accurate 3D medical image segmentation.
arXiv Detail & Related papers (2021-03-04T13:34:22Z) - Light Field Reconstruction via Deep Adaptive Fusion of Hybrid Lenses [67.01164492518481]
This paper explores the problem of reconstructing high-resolution light field (LF) images from hybrid lenses.
We propose a novel end-to-end learning-based approach, which can comprehensively utilize the specific characteristics of the input.
Our framework could potentially decrease the cost of high-resolution LF data acquisition and benefit LF data storage and transmission.
arXiv Detail & Related papers (2021-02-14T06:44:47Z) - Bayesian Image Reconstruction using Deep Generative Models [7.012708932320081]
In this work, we leverage state-of-the-art (SOTA) generative models for building powerful image priors.
Our method, called Bayesian Reconstruction through Generative Models (BRGM), uses a single pre-trained generator model to solve different image restoration tasks.
arXiv Detail & Related papers (2020-12-08T17:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.