Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference
- URL: http://arxiv.org/abs/2507.01608v1
- Date: Wed, 02 Jul 2025 11:21:38 GMT
- Title: Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference
- Authors: Xu Zhang, Ming Lu, Yan Chen, Zhan Ma,
- Abstract summary: Perception-Oriented Latent Coding (POLC) is an approach that enriches the semantic content of latent features for high-performance semantic inference.<n>POLC requires only a plug-and-play adapter for fine-tuning, significantly reducing the parameter count compared to previous MSE-oriented methods.
- Score: 30.78149130760627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire vision model, which is computationally intensive, especially for large models. To address these problems, we introduce Perception-Oriented Latent Coding (POLC), an approach that enriches the semantic content of latent features for high-performance compressed domain semantic inference. With the semantically rich latent space, POLC requires only a plug-and-play adapter for fine-tuning, significantly reducing the parameter count compared to previous MSE-oriented methods. Experimental results demonstrate that POLC achieves rate-perception performance comparable to state-of-the-art generative image coding methods while markedly enhancing performance in vision tasks, with minimal fine-tuning overhead. Code is available at https://github.com/NJUVISION/POLC.
Related papers
- Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation [18.670626228472877]
DIFFT redefines Feature Transformation as a reward-guided generative task.<n>It produces structured, discrete features, preserving intra-feature dependencies while allowing parallel inter-feature generation.<n>It consistently outperforms state-of-the-art baselines in predictive accuracy and robustness, with significantly lower training and inference times.
arXiv Detail & Related papers (2025-05-21T06:18:42Z) - Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z) - IPSeg: Image Posterior Mitigates Semantic Drift in Class-Incremental Segmentation [77.06177202334398]
We identify two critical challenges in CISS that contribute to semantic drift and degrade performance.<n>First, we highlight the issue of separate optimization, where different parts of the model are optimized in distinct incremental stages.<n>Second, we identify noisy semantics arising from inappropriate pseudo-labeling, which results in sub-optimal results.
arXiv Detail & Related papers (2025-02-07T12:19:37Z) - ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis [9.11767497956649]
This paper proposes leveraging the language comprehension capabilities of large vision-language models to guide the optimization of the initial noisy latent.
We introduce the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency.
Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models.
arXiv Detail & Related papers (2024-11-25T15:40:47Z) - Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Efficient Semantic Image Synthesis via Class-Adaptive Normalization [116.63715955932174]
Class-adaptive normalization (CLADE) is a lightweight but equally-effective variant that is only adaptive to semantic class.
We introduce intra-class positional map encoding calculated from semantic layouts to modulate the normalization parameters of CLADE.
The proposed CLADE can be generalized to different SPADE-based methods while achieving comparable generation quality compared to SPADE.
arXiv Detail & Related papers (2020-12-08T18:59:32Z) - Prior Guided Feature Enrichment Network for Few-Shot Segmentation [64.91560451900125]
State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results.
Few-shot segmentation is proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples.
Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information.
arXiv Detail & Related papers (2020-08-04T10:41:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.