From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning
- URL: http://arxiv.org/abs/2602.03390v1
- Date: Tue, 03 Feb 2026 11:11:58 GMT
- Title: From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning
- Authors: Hyun Seok Seong, WonJun Moon, Jae-Pil Heo,
- Abstract summary: We introduce a virtuous cycle where the encoder and decoder mutually refine one another.<n>By bridging the representational gap between the encoder and decoder, SRL achieves state-of-the-art results on video object-centric learning benchmarks.
- Score: 45.1920794546889
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Unsupervised object-centric learning models, particularly slot-based architectures, have shown great promise in decomposing complex scenes. However, their reliance on reconstruction-based training creates a fundamental conflict between the sharp, high-frequency attention maps of the encoder and the spatially consistent but blurry reconstruction maps of the decoder. We identify that this discrepancy gives rise to a vicious cycle: the noisy feature map from the encoder forces the decoder to average over possibilities and produce even blurrier outputs, while the gradient computed from blurry reconstruction maps lacks high-frequency details necessary to supervise encoder features. To break this cycle, we introduce Synergistic Representation Learning (SRL) that establishes a virtuous cycle where the encoder and decoder mutually refine one another. SRL leverages the encoder's sharpness to deblur the semantic boundary within the decoder output, while exploiting the decoder's spatial consistency to denoise the encoder's features. This mutual refinement process is stabilized by a warm-up phase with a slot regularization objective that initially allocates distinct entities per slot. By bridging the representational gap between the encoder and decoder, SRL achieves state-of-the-art results on video object-centric learning benchmarks. Codes are available at https://github.com/hynnsk/SRL.
Related papers
- Improving Reconstruction of Representation Autoencoder [52.817427902597416]
We propose LV-RAE, a representation autoencoder that augments semantic features with missing low-level information.<n>Our experiments demonstrate that LV-RAE significantly improves reconstruction fidelity while preserving the semantic abstraction.
arXiv Detail & Related papers (2026-02-09T13:12:35Z) - VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z) - Hybrid Autoencoders for Tabular Data: Leveraging Model-Based Augmentation in Low-Label Settings [13.591018807414484]
We propose a hybrid autoencoder that combines a neural encoder with an oblivious soft decision tree (OSDT) encoder, each guided by its own gating network.<n>Our method achieves consistent gains in low-label classification and regression across diverse datasets, outperforming deep and tree-based supervised baselines.
arXiv Detail & Related papers (2025-11-10T11:08:39Z) - SIEDD: Shared-Implicit Encoder with Discrete Decoders [36.705337163276255]
Implicit Neural Representations (INRs) offer exceptional fidelity for video compression by learning per-video optimized functions.<n>Existing attempts to accelerate INR encoding often sacrifice reconstruction quality or crucial coordinate-level control.<n>We introduce SIEDD, a novel architecture that fundamentally accelerates INR encoding without these compromises.
arXiv Detail & Related papers (2025-06-29T19:39:43Z) - A Revisit to the Decoder for Camouflaged Object Detection [34.886607866949845]
Camouflaged object detection (COD) aims to generate a fine-grained segmentation map of camouflaged objects hidden in their background.<n>We propose a novel architecture that augments the prevalent decoding strategy in COD with Enrich Decoder and Retouch Decoder.
arXiv Detail & Related papers (2025-03-18T08:51:50Z) - Epsilon-VAE: Denoising as Visual Decoding [61.29255979767292]
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement.<n>Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image.<n>By adopting iterative reconstruction through diffusion, our autoencoder, namely Epsilon-VAE, achieves high reconstruction quality.
arXiv Detail & Related papers (2024-10-05T08:27:53Z) - More complex encoder is not all you need [0.882348769487259]
We introduce neU-Net (i.e., not complex encoder U-Net), which incorporates a novel Sub-pixel Convolution for upsampling to construct a powerful decoder.
Our model design achieves excellent results, surpassing other state-of-the-art methods on both the Synapse and ACDC datasets.
arXiv Detail & Related papers (2023-09-20T08:34:38Z) - GAN-Based Multi-View Video Coding with Spatio-Temporal EPI
Reconstruction [19.919826392704472]
We propose a novel multi-view video coding method that leverages the image generation capabilities of Generative Adrial Network (GAN)
At the encoder, we construct atemporal Epipolar Plane Image (EPI) decoder and further utilize a convolutional network to extract the latent code of a GAN as Side Information (SI)
At the side, we combine SI and adjacent viewpoints to reconstruct intermediate views using the GAN generator.
arXiv Detail & Related papers (2022-05-07T08:52:54Z) - Reducing Redundancy in the Bottleneck Representation of the Autoencoders [98.78384185493624]
Autoencoders are a type of unsupervised neural networks, which can be used to solve various tasks.
We propose a scheme to explicitly penalize feature redundancies in the bottleneck representation.
We tested our approach across different tasks: dimensionality reduction using three different dataset, image compression using the MNIST dataset, and image denoising using fashion MNIST.
arXiv Detail & Related papers (2022-02-09T18:48:02Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.