Related papers: LumiX: Structured and Coherent Text-to-Intrinsic Generation

LumiX: Structured and Coherent Text-to-Intrinsic Generation

URL: http://arxiv.org/abs/2512.02781v1
Date: Tue, 02 Dec 2025 13:56:02 GMT
Title: LumiX: Structured and Coherent Text-to-Intrinsic Generation
Authors: Xu Han, Biao Zhang, Xiangjun Tang, Xianzhi Li, Peter Wonka,
Abstract summary: We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation.<n>LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score.<n>It can also perform image-conditioned decomposition within the same framework.
Score: 56.659456254026985
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.

Related papers

Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition [7.632962062462334]
Zero-shot Handwritten Chinese Character Recognition aims to recognize unseen characters by leveraging radical-based semantic compositions.<n>We propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling.<n>Our method establishes new state-of-the-art performance, achieving an accuracy of 55.04% on the ICDAR 2013 dataset.
arXiv Detail & Related papers (2026-02-03T16:08:40Z)
Improving LLM Reasoning with Homophily-aware Structural and Semantic Text-Attributed Graph Compression [55.51959317490934]
Large language models (LLMs) have demonstrated promising capabilities in Text-Attributed Graph (TAG) understanding.<n>We argue that graphs inherently contain rich structural and semantic information, and that their effective exploitation can unlock potential gains in LLMs reasoning performance.<n>We propose Homophily-aware Structural and Semantic Compression for LLMs (HS2C), a framework centered on exploiting graph homophily.
arXiv Detail & Related papers (2026-01-13T03:35:18Z)
Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion [31.189038928192648]
Co2S is a semi-supervised RS segmentation framework that fuses priors from vision-language models and self-supervised models.<n>An explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries.<n>Experiments on six popular datasets demonstrate the superiority of the proposed method.
arXiv Detail & Related papers (2025-12-28T18:24:19Z)
Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers [55.15722080205737]
Edit2Perceive is a unified diffusion framework that adapts editing models for depth, normal, and matting.<n>Our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.
arXiv Detail & Related papers (2025-11-24T01:13:51Z)
Robust Image Stitching with Optimal Plane [39.80133570371559]
textitRopStitch is an unsupervised deep image stitching framework with both robustness and naturalness.<n>textitRopStitch significantly outperforms existing methods, particularly in scene robustness and content naturalness.
arXiv Detail & Related papers (2025-08-07T23:53:26Z)
CorrMoE: Mixture of Experts with De-stylization Learning for Cross-Scene and Cross-Domain Correspondence Pruning [30.111296778234124]
CorrMoE is a correspondence pruning framework that enhances robustness under cross-domain and cross-scene variations.<n>For scene diversity, we design a Bi-Fusion Mixture of Experts module that adaptively integrates multi-perspective features.<n>Experiments on benchmark datasets demonstrate that CorrMoE achieves superior accuracy and generalization compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-07-16T01:44:01Z)
When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product [21.018675431494838]
State-of-the-art embeddings often capture distinct yet complementary discriminative features.<n>We propose a principled approach to fuse such complementary representations through kernel multiplication.<n>We develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation.
arXiv Detail & Related papers (2025-06-10T09:57:58Z)
Unlocking Multi-Modal Potentials for Link Prediction on Dynamic Text-Attributed Graphs [28.533930417703715]
Dynamic Text-Attributed Graphs (DyTAGs) are a novel graph paradigm that captures evolving temporal events (edges) alongside rich textual attributes.<n>MoMent is a multi-modal model that explicitly models, integrates, and aligns each modality to learn node representations for link prediction.<n>Experiments show that MoMent achieves up to 17.28% accuracy improvement and up to 31x speed-up against eight baselines.
arXiv Detail & Related papers (2025-02-27T00:49:44Z)
Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample. We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z)
DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment [124.57488600605822]
Cross-modal garment synthesis and manipulation will significantly benefit the way fashion designers generate garments. We introduce DiffCloth, a diffusion-based pipeline for cross-modal garment synthesis and manipulation. Experiments on the CM-Fashion benchmark demonstrate that DiffCloth both yields state-of-the-art garment synthesis results.
arXiv Detail & Related papers (2023-08-22T05:43:33Z)
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment. Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z)
Syntactically Robust Training on Partially-Observed Data for Open Information Extraction [25.59133746149343]
Open Information Extraction models have shown promising results with sufficient supervision. We propose a syntactically robust training framework that enables models to be trained on a syntactic-abundant distribution.
arXiv Detail & Related papers (2023-01-17T12:39:13Z)
Image Synthesis via Semantic Composition [74.68191130898805]
We present a novel approach to synthesize realistic images based on their semantic layouts. It hypothesizes that for objects with similar appearance, they share similar representation. Our method establishes dependencies between regions according to their appearance correlation, yielding both spatially variant and associated representations.
arXiv Detail & Related papers (2021-09-15T02:26:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.