MUFASA: A Multi-Layer Framework for Slot Attention
- URL: http://arxiv.org/abs/2602.07544v1
- Date: Sat, 07 Feb 2026 13:44:56 GMT
- Title: MUFASA: A Multi-Layer Framework for Slot Attention
- Authors: Sebastian Bock, Leonie Schüßler, Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth,
- Abstract summary: We introduce MUFASA, a plug-and-play framework for slot attention-based approaches to unsupervised object segmentation.<n>Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness.<n>We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation.
- Score: 16.325300304610035
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.
Related papers
- Wasserstein-Aligned Hyperbolic Multi-View Clustering [58.29261653100388]
This paper proposes a novel Wasserstein-Aligned Hyperbolic (WAH) framework for multi-view clustering.<n>Our method exploits a view-specific hyperbolic encoder for each view to embed features into the Lorentz manifold for hierarchical semantic modeling.
arXiv Detail & Related papers (2025-12-10T07:56:19Z) - FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation [55.01077993490845]
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling.<n>We introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework.
arXiv Detail & Related papers (2025-06-20T07:46:40Z) - MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning [17.083645139372912]
We introduce MetaSlot, a plug-and-play Slot Attention variant that adapts to variable object counts.<n>We show that MetaSlot achieves significant performance gains and markedly interpretable slot representations, compared with existing Slot Attention variants.
arXiv Detail & Related papers (2025-05-27T06:23:03Z) - Masked Multi-Query Slot Attention for Unsupervised Object Discovery [7.613552182035413]
In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of representations queried called slots.
We propose a masking scheme on input features that disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase.
Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization.
arXiv Detail & Related papers (2024-04-30T15:51:05Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - Multi-Object Tracking by Hierarchical Visual Representations [40.521291165765696]
We propose a new visual hierarchical representation paradigm for multi-object tracking.
It is more effective to discriminate between objects by attending to objects' compositional visual regions and contrasting with the background contextual information.
arXiv Detail & Related papers (2024-02-24T20:10:44Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - Self-supervised Object-Centric Learning for Videos [39.02148880719576]
We propose the first fully unsupervised method for segmenting multiple objects in real-world sequences.
Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames.
Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.
arXiv Detail & Related papers (2023-10-10T18:03:41Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Shepherding Slots to Objects: Towards Stable and Robust Object-Centric
Learning [28.368429312400885]
Single-view images carry less information about how to disentangle a given scene than videos or multi-view images do.
We introduce a novel OCL framework for single-view images, SLot Attention via SHepherding (SLASH), which consists of two simple-yet-effective modules on top of Slot Attention.
Our proposed method enables consistent learning of object-centric representation and achieves strong performance across four datasets.
arXiv Detail & Related papers (2023-03-31T07:07:29Z) - De-coupling and De-positioning Dense Self-supervised Learning [65.56679416475943]
Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects.
We show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding.
We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection.
arXiv Detail & Related papers (2023-03-29T18:07:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.