Related papers: Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

URL: http://arxiv.org/abs/2403.07887v1
Date: Fri, 2 Feb 2024 12:37:23 GMT
Title: Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations
Authors: Bhishma Dedhia, Niraj K. Jha,
Abstract summary: We present the Neural Slot Interpreter (NSI) that learns to ground and generate object semantics via slot representations. NSI is an XML-like programming language that uses simple syntax rules to organize the object semantics of a scene into object-centric program primitives.
Score: 4.807052027638089
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Object-centric methods have seen significant progress in unsupervised decomposition of raw perception into rich object-like abstractions. However, limited ability to ground object semantics of the real world into the learned abstractions has hindered their adoption in downstream understanding applications. We present the Neural Slot Interpreter (NSI) that learns to ground and generate object semantics via slot representations. At the core of NSI is an XML-like programming language that uses simple syntax rules to organize the object semantics of a scene into object-centric program primitives. Then, an alignment model learns to ground program primitives into slots through a bi-level contrastive learning objective over a shared embedding space. Finally, we formulate the NSI program generator model to use the dense associations inferred from the alignment model to generate object-centric programs from slots. Experiments on bi-modal retrieval tasks demonstrate the efficacy of the learned alignments, surpassing set-matching-based predictors by a significant margin. Moreover, learning the program generator from grounded associations enhances the predictive power of slots. NSI generated programs demonstrate improved performance of object-centric learners on property prediction and object detection, and scale with real-world scene complexity.

Related papers

ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting [54.92763171355442]
ObjectGS is an object-aware framework that unifies 3D scene reconstruction with semantic understanding.<n>We show through experiments that ObjectGS outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks.
arXiv Detail & Related papers (2025-07-21T10:06:23Z)
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning. We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z)
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models [10.792834356227118]
Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning. Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities.
arXiv Detail & Related papers (2025-03-21T17:51:14Z)
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [52.478956204238315]
We study the spatial reasoning challenge from the lens of mechanistic interpretability. We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations. Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z)
What Makes a Maze Look Like a Maze? [92.80800000328277]
We introduce Deep Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models.
arXiv Detail & Related papers (2024-09-12T16:41:47Z)
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding. Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z)
EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation [5.476136494434766]
We introduce EiCue, a technique providing semantic and structural cues through an eigenbasis derived from semantic similarity matrix. We guide our model to learn object-level representations with intra- and inter-image object-feature consistency. Experiments on COCO-Stuff, Cityscapes, and Potsdam-3 datasets demonstrate the state-of-the-art USS results.
arXiv Detail & Related papers (2024-03-03T11:24:16Z)
ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection [70.11264880907652]
Recent object (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios. We propose an effective unified collaborative pyramid network that mimics human behavior when observing vague images and camouflaged zooming in and out. Our framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks.
arXiv Detail & Related papers (2023-10-31T06:11:23Z)
Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
Cycle Consistency Driven Object Discovery [75.60399804639403]
We introduce a method that explicitly optimize the constraint that each object in a scene should be associated with a distinct slot. By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance. Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks.
arXiv Detail & Related papers (2023-06-03T21:49:06Z)
Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior [88.9319150230121]
Object-centric vision aims to construct an explicit representation of the objects in a scene. We incorporate a spatial-locality prior into state-of-the-art object-centric vision models. We obtain significant improvements in segmenting objects in both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-31T04:35:50Z)
Hyperbolic Contrastive Learning for Visual Representations beyond Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them. Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z)
Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data. We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z)
Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder. We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets. We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z)
Object Pursuit: Building a Space of Objects via Discriminative Weight Generation [23.85039747700698]
We propose a framework to continuously learn object-centric representations for visual learning and understanding. We leverage interactions to sample diverse variations of an object and the corresponding training signals while learning the object-centric representations. We perform an extensive study of the key features of the proposed framework and analyze the characteristics of the learned representations.
arXiv Detail & Related papers (2021-12-15T08:25:30Z)
SORNet: Spatial Object-Centric Representations for Sequential Manipulation [39.88239245446054]
Sequential manipulation tasks require a robot to perceive the state of an environment and plan a sequence of actions leading to a desired goal state. We propose SORNet, which extracts object-centric representations from RGB images conditioned on canonical views of the objects of interest.
arXiv Detail & Related papers (2021-09-08T19:36:29Z)
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z)
Constellation: Learning relational abstractions over objects for compositional imagination [64.99658940906917]
We introduce Constellation, a network that learns relational abstractions of static visual scenes. This work is a first step in the explicit representation of visual relationships and using them for complex cognitive procedures.
arXiv Detail & Related papers (2021-07-23T11:59:40Z)
Language-Mediated, Object-Centric Representation Learning [21.667413971464455]
We present Language-mediated, Object-centric Representation Learning (LORL) LORL is a paradigm for learning disentangled, object-centric scene representations from vision and language. It can be integrated with various unsupervised segmentation algorithms that are language-agnostic.
arXiv Detail & Related papers (2020-12-31T18:36:07Z)
Object-Centric Learning with Slot Attention [43.684193749891506]
We present the Slot Attention module, an architectural component that interfaces with perceptual representations. Slot Attention produces task-dependent abstract representations which we call slots. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions.
arXiv Detail & Related papers (2020-06-26T15:31:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.