Related papers: SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook

SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook

URL: http://arxiv.org/abs/2409.06105v1
Date: Mon, 9 Sep 2024 23:12:43 GMT
Title: SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook
Authors: Chenjing Ding, Chiyu Wang, Boshi Liu, Xi Guo, Weixuan Tang, Wei Wu,
Abstract summary: We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning. Our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics.
Score: 9.993066868670283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vector quantization (VQ) is a method for deterministically learning features through discrete codebook representations. Recent works have utilized visual tokenizers to discretize visual regions for self-supervised representation learning. However, a notable limitation of these tokenizers is lack of semantics, as they are derived solely from the pretext task of reconstructing raw image pixels in an auto-encoder paradigm. Additionally, issues like imbalanced codebook distribution and codebook collapse can adversely impact performance due to inefficient codebook utilization. To address these challenges, We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning. Utilizing inference results from segmentation model , our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics. Our proposed Pyramid Feature Learning pipeline integrates multi-level features to capture both image details and semantics simultaneously. As a result, SGC-VQGAN achieves SOTA performance in both reconstruction quality and various downstream tasks. Its simplicity, requiring no additional parameter learning, enables its direct application in downstream tasks, presenting significant potential.

Related papers

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z)
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation [80.90309237362526]
TokLIP is a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens.<n>TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics.
arXiv Detail & Related papers (2025-05-08T17:12:19Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation [73.98487014058286]
SemHiTok is a unified image tokenizer via Semantic-Guided Hierarchical codebook. We show that SemHiTok achieves excellent rFID score at 256X256resolution compared to other unified tokenizers.
arXiv Detail & Related papers (2025-03-09T20:42:34Z)
BRIDLE: Generalized Self-supervised Learning with Quantization [15.121857164574704]
Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio. We introduce BRIDLE, a self-supervised pretraining framework that incorporates residual quantization into the bidirectional training process.
arXiv Detail & Related papers (2025-02-04T08:54:06Z)
Leveraging Joint Predictive Embedding and Bayesian Inference in Graph Self Supervised Learning [0.0]
Graph representation learning has emerged as a cornerstone for tasks like node classification and link prediction. Current self-supervised learning (SSL) methods face challenges such as computational inefficiency, reliance on contrastive objectives, and representation collapse. We propose a novel joint embedding predictive framework for graph SSL that eliminates contrastive objectives and negative sampling while preserving semantic and structural information.
arXiv Detail & Related papers (2025-02-02T07:42:45Z)
Improving vision-language alignment with graph spiking hybrid Networks [10.88584928028832]
This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate fine-grained semantic features. We propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information.
arXiv Detail & Related papers (2025-01-31T11:55:17Z)
Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation [8.659766913542938]
We study a united perceptual and semantic token compression for all granular understanding. We propose Feature Pyramid Tokenization (PAT) to cluster and represent multi-resolution feature by learnable codebooks. Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid.
arXiv Detail & Related papers (2024-12-18T18:43:21Z)
Factorized Visual Tokenization and Generation [37.56136469262736]
We introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-11-25T18:59:53Z)
Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models. We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z)
Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting [49.87694319431288]
Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources. We propose a Comprehensive Generative (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs. Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting.
arXiv Detail & Related papers (2024-06-28T10:05:58Z)
Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images [76.47980643420375]
This paper builds on the hypothesis that there is an inherent data-hungry matter in learning semantic correspondences. We demonstrate a simple machine annotator reliably enriches paired key points via machine supervision. Our models surpass current state-of-the-art models on semantic correspondence learning benchmarks like SPair-71k, PF-PASCAL, and PF-WILLOW.
arXiv Detail & Related papers (2023-11-30T13:22:15Z)
MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks. We propose a single-stage and standalone method, MOCA, which unifies both desired properties. We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z)
SC-VAE: Sparse Coding-based Variational Autoencoder with Learned ISTA [0.6770292596301478]
We introduce a new VAE variant, termed sparse coding-based VAE with learned ISTA (SC-VAE), which integrates sparse coding within variational autoencoder framework. Experiments on two image datasets demonstrate that our model achieves improved image reconstruction results compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-03-29T13:18:33Z)
Learning Context-aware Classifier for Semantic Segmentation [88.88198210948426]
In this paper, contextual hints are exploited via learning a context-aware classifier. Our method is model-agnostic and can be easily applied to generic segmentation models. With only negligible additional parameters and +2% inference time, decent performance gain has been achieved on both small and large models.
arXiv Detail & Related papers (2023-03-21T07:00:35Z)
Vector Quantized Wasserstein Auto-Encoder [57.29764749855623]
We study learning deep discrete representations from the generative viewpoint. We endow discrete distributions over sequences of codewords and learn a deterministic decoder that transports the distribution over the sequences of codewords to the data distribution. We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution.
arXiv Detail & Related papers (2023-02-12T13:51:36Z)
Leveraging Auxiliary Tasks with Affinity Learning for Weakly Supervised Semantic Segmentation [88.49669148290306]
We propose a novel weakly supervised multi-task framework called AuxSegNet to leverage saliency detection and multi-label image classification as auxiliary tasks. Inspired by their similar structured semantics, we also propose to learn a cross-task global pixel-level affinity map from the saliency and segmentation representations. The learned cross-task affinity can be used to refine saliency predictions and propagate CAM maps to provide improved pseudo labels for both tasks.
arXiv Detail & Related papers (2021-07-25T11:39:58Z)
Robust Training of Vector Quantized Bottleneck Models [21.540133031071438]
We demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs) For unsupervised representation learning, they became viable alternatives to continuous latent variable models such as the Variational Auto-Encoder (VAE)
arXiv Detail & Related papers (2020-05-18T08:23:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.