Positional Encoding Field
- URL: http://arxiv.org/abs/2510.20385v1
- Date: Thu, 23 Oct 2025 09:32:37 GMT
- Title: Positional Encoding Field
- Authors: Yunpeng Bai, Haoxiang Li, Qixing Huang,
- Abstract summary: Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation.<n>We revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence.<n>We introduce Positional.<n>Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field.
- Score: 44.0217294710719
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D space. Our PE-Field-augmented DiT achieves state-of-the-art performance on single-image novel view synthesis and generalizes to controllable spatial image editing.
Related papers
- CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection [21.94827944503605]
Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents.<n>Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration.<n>We propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones.
arXiv Detail & Related papers (2026-03-05T10:49:46Z) - Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing [62.94394079771687]
A burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents.<n>We propose a systematic framework to adapt understanding-oriented encoder features for generative tasks.<n>We show that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both Text-to-Image (T2I) and image editing tasks.
arXiv Detail & Related papers (2025-12-19T18:59:57Z) - FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers [91.59069344768858]
We introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder.<n>FreqPDE combines the 2D image features and 3D position embeddings to generate 3D depth-aware features for query decoding.
arXiv Detail & Related papers (2025-10-17T07:36:54Z) - Cameras as Relative Positional Encoding [37.675563572777136]
Multi-view transformers must use camera geometry to ground visual tokens in 3D space.<n>We show how relative camera conditioning improves performance in feedforward novel view synthesis.<n>We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative cognition, as well as larger model sizes.
arXiv Detail & Related papers (2025-07-14T17:22:45Z) - CrossModalityDiffusion: Multi-Modal Novel View Synthesis with Unified Intermediate Representation [0.5242869847419834]
CrossModalityDiffusion is a modular framework designed to generate images across different modalities without prior knowledge of scene geometry.<n>We show that jointly training different modules ensures consistent geometric understanding across all modalities within the framework.<n>We validate CrossModalityDiffusion's capabilities on the synthetic ShapeNet cars dataset.
arXiv Detail & Related papers (2025-01-16T20:56:32Z) - Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation.
Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model.
Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z) - GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers [63.41460219156508]
We argue that existing positional encoding schemes are suboptimal for 3D vision tasks.
We propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation.
We show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models.
arXiv Detail & Related papers (2023-10-16T13:16:09Z) - High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization [51.878078860524795]
We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views.
Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content.
arXiv Detail & Related papers (2022-11-28T18:59:52Z) - PSFormer: Point Transformer for 3D Salient Object Detection [8.621996554264275]
PSFormer is an encoder-decoder network that takes full advantage of transformers to model contextual information.
In the encoder, we develop a Point Context Transformer (PCT) module to capture region contextual features at the point level.
In the decoder, we develop a Scene Context Transformer (SCT) module to learn context representations at the scene level.
arXiv Detail & Related papers (2022-10-28T06:34:28Z) - Geometry Attention Transformer with Position-aware LSTMs for Image
Captioning [8.944233327731245]
This paper proposes an improved Geometry Attention Transformer (GAT) model.
In order to further leverage geometric information, two novel geometry-aware architectures are designed.
Our GAT could often outperform current state-of-the-art image captioning models.
arXiv Detail & Related papers (2021-10-01T11:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.