GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
- URL: http://arxiv.org/abs/2509.01109v2
- Date: Fri, 19 Sep 2025 10:05:36 GMT
- Title: GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
- Authors: Zhengqiang Zhang, Rongyuan Wu, Lingchen Sun, Lei Zhang,
- Abstract summary: GPSToken is a novel $textbfG$aussian $textbfP$arameterized $textbfS$patially-adaptive $textbfToken$ization framework.<n>GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation.
- Score: 19.94399008500357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively. Codes and models of GPSToken can be found at $\href{https://github.com/xtudbxk/GPSToken}{https://github.com/xtudbxk/GPSToken}$.
Related papers
- GSPN-2: Efficient Parallel Sequence Modeling [101.33780567131716]
Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme.<n>GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications.
arXiv Detail & Related papers (2025-11-28T07:26:45Z) - Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting [4.2390854432099205]
Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora.<n>These pipelines inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy intensive and costly, and (ii) patch based tokenization explodes sequence length.<n>We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment.
arXiv Detail & Related papers (2025-09-26T17:41:57Z) - 2D Gaussian Splatting with Semantic Alignment for Image Inpainting [46.266955851252504]
We propose the first image inpainting framework based on 2D Gaussian Splatting.<n>For global semantic consistency, we incorporate features from a pretrained DINO model.<n>Our method achieves competitive performance in both quantitative metrics and perceptual quality.
arXiv Detail & Related papers (2025-09-02T05:12:52Z) - RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration [10.88046882501116]
RegGS is a 3D Gaussian registration-based framework for reconstructing unposed views.<n>We implement an entropy-regularized Sinkhorn algorithm to efficiently solve the optimal transport Mixture 2-Wasserstein $(textMW_2)$ distance.<n>We also design a joint 3DGS registration module that integrates the $textMW$ distance, photometric consistency, and depth geometry.
arXiv Detail & Related papers (2025-07-10T19:56:08Z) - GViT: Representing Images as Gaussians for Visual Recognition [54.46109876668194]
We introduce GVIT, a classification framework that abandons conventional pixel or patch grid input representations in favor of a compact set of learnable 2D Gaussians.<n>We demonstrate that by 2D Gaussian input representations coupled with our GVIT guidance, using a relatively standard ViT architecture, closely matches the performance of a traditional patch-based ViT.
arXiv Detail & Related papers (2025-06-30T05:44:14Z) - Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation [57.56385490252605]
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention.<n>We propose SVG2, a training-free framework that maximizes identification accuracy and computation minimizes waste.
arXiv Detail & Related papers (2025-05-24T21:30:29Z) - GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting [64.84383010238908]
We propose an effective image tokenizer with 2D Gaussian Splatting as a solution.<n>In general, our framework integrates the local influence of 2D Gaussian distribution into the discrete space.<n> Competitive reconstruction performances on CIFAR, Mini-Net, and ImageNet-1K demonstrate the effectiveness of our framework.
arXiv Detail & Related papers (2025-01-26T17:56:11Z) - RefineStyle: Dynamic Convolution Refinement for StyleGAN [15.230430037135017]
In StyleGAN, convolution kernels are shaped by both static parameters shared across images.
$mathcalW+$ space is often used for image inversion and editing.
This paper proposes an efficient refining strategy for dynamic kernels.
arXiv Detail & Related papers (2024-10-08T15:01:30Z) - Image-GS: Content-Adaptive Image Representation via 2D Gaussians [52.598772767324036]
We introduce Image-GS, a content-adaptive image representation based on 2D Gaussians radiance.<n>It supports hardware-friendly rapid access for real-time usage, requiring only 0.3K MACs to decode a pixel.<n>We demonstrate its versatility with several applications, including texture compression, semantics-aware compression, and joint image compression and restoration.
arXiv Detail & Related papers (2024-07-02T00:45:21Z) - SG-Former: Self-guided Transformer with Evolving Token Reallocation [89.9363449724261]
We propose a novel model, termed as Self-guided Transformer, towards effective global self-attention with adaptive fine granularity.
We assign more tokens to the salient regions for achieving fine-grained attention, while allocating fewer tokens to the minor regions in exchange for efficiency and global receptive fields.
The proposed SG-Former achieves superior performance superior to state of the art: our base size model achieves textbf84.7% Top-1 accuracy on ImageNet-1K, textbf51.2mAP BBAP on CoCo, textbf52.7mIoU
arXiv Detail & Related papers (2023-08-23T15:52:45Z) - Near Perfect GAN Inversion [17.745342857726925]
We derive an algorithm that achieves near perfect reconstructions of photos.
We show that this approach can not only produce synthetic images that are indistinguishable from the real photos we wish to replicate, but that these images are readily editable.
arXiv Detail & Related papers (2022-02-23T23:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.