Related papers: The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

URL: http://arxiv.org/abs/2512.19693v1
Date: Mon, 22 Dec 2025 18:59:57 GMT
Title: The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
Authors: Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu,
Abstract summary: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders retain high-frequency information that conveys fine-grained detail.<n>We propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator.
Score: 82.53463660564933
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.

Related papers

Phi-SegNet: Phase-Integrated Supervision for Medical Image Segmentation [1.76179873429447]
We propose Phi-SegNet, a CNN-based architecture that incorporates phase-aware information at both architectural and optimization levels.<n>Phi-SegNet consistently achieved state-of-the-art performance on five public datasets spanning X-ray, US, histopathology, MRI, and colonoscopy.
arXiv Detail & Related papers (2026-01-22T16:00:41Z)
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z)
Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models [18.24287471339871]
Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands.<n>Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features.<n>Our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods.
arXiv Detail & Related papers (2025-09-24T13:32:07Z)
Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection [15.125734989910429]
We propose a wavelet attention-like backbone and a ray-based encoder architecture tailored for HOI detection.<n>Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from convolutional filters.<n>Our decoder aligns query embeddings with emphasized regions of interest for accurate predictions.
arXiv Detail & Related papers (2025-07-15T04:44:54Z)
CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis [69.02751635551724]
Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding.<n> variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies.<n>We introduce CARL, a model for Camera-Agnostic Representation Learning across RGB, multispectral, and hyperspectral imaging modalities.
arXiv Detail & Related papers (2025-04-27T13:06:40Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Triple-View Knowledge Distillation for Semi-Supervised Semantic Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation. The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z)
Convolutional Autoencoder for Blind Hyperspectral Image Unmixing [0.0]
spectral unmixing is a technique to decompose a mixed pixel into two fundamental representatives: endmembers and abundances. In this paper, a novel architecture is proposed to perform blind unmixing on hyperspectral images.
arXiv Detail & Related papers (2020-11-18T17:41:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.