The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
- URL: http://arxiv.org/abs/2512.19693v1
- Date: Mon, 22 Dec 2025 18:59:57 GMT
- Title: The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
- Authors: Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu,
- Abstract summary: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders retain high-frequency information that conveys fine-grained detail.<n>We propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator.
- Score: 82.53463660564933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.
Related papers
- Phi-SegNet: Phase-Integrated Supervision for Medical Image Segmentation [1.76179873429447]
We propose Phi-SegNet, a CNN-based architecture that incorporates phase-aware information at both architectural and optimization levels.<n>Phi-SegNet consistently achieved state-of-the-art performance on five public datasets spanning X-ray, US, histopathology, MRI, and colonoscopy.
arXiv Detail & Related papers (2026-01-22T16:00:41Z) - VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z) - Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models [18.24287471339871]
Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands.<n>Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features.<n>Our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods.
arXiv Detail & Related papers (2025-09-24T13:32:07Z) - Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection [15.125734989910429]
We propose a wavelet attention-like backbone and a ray-based encoder architecture tailored for HOI detection.<n>Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from convolutional filters.<n>Our decoder aligns query embeddings with emphasized regions of interest for accurate predictions.
arXiv Detail & Related papers (2025-07-15T04:44:54Z) - CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis [69.02751635551724]
Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding.<n> variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies.<n>We introduce CARL, a model for Camera-Agnostic Representation Learning across RGB, multispectral, and hyperspectral imaging modalities.
arXiv Detail & Related papers (2025-04-27T13:06:40Z) - Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z) - "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - Convolutional Autoencoder for Blind Hyperspectral Image Unmixing [0.0]
spectral unmixing is a technique to decompose a mixed pixel into two fundamental representatives: endmembers and abundances.
In this paper, a novel architecture is proposed to perform blind unmixing on hyperspectral images.
arXiv Detail & Related papers (2020-11-18T17:41:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.