CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
- URL: http://arxiv.org/abs/2503.21747v1
- Date: Thu, 27 Mar 2025 17:53:50 GMT
- Title: CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
- Authors: Aniket Didolkar, Andrii Zadaianchuk, Rabiul Awal, Maximilian Seitzer, Efstratios Gavves, Aishwarya Agrawal,
- Abstract summary: Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files"<n>Current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented.<n>We propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions.
- Score: 30.218743514199016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.
Related papers
- v-CLR: View-Consistent Learning for Open-World Instance Segmentation [24.32192108470939]
Vanilla visual networks are biased toward learning appearance information, eg texture, to recognize objects.
This implicit bias causes the model to fail in detecting novel objects with textures unseen in the open-world setting.
We propose view-Consistent LeaRning (v-CLR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation.
arXiv Detail & Related papers (2025-04-02T05:52:30Z) - Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval [1.4272411349249627]
Self-supervised vision models like DINO have shown emergent object understanding.<n>DINO representations excel at capturing global object attributes but struggle with object-level details like colour.<n>We propose a method that combines global and local features by augmenting DINO representations with object-centric latent vectors.
arXiv Detail & Related papers (2025-03-12T21:57:41Z) - ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos [105.40690994956667]
Ego-Exo Object Correspondence task aims to map objects across ego-centric and exo-centric views.<n>We introduce ObjectRelator, a novel method designed to tackle this task.
arXiv Detail & Related papers (2024-11-28T12:01:03Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Learning Global Object-Centric Representations via Disentangled Slot Attention [38.78205074748021]
This paper introduces a novel object-centric learning method to empower AI systems with human-like capabilities to identify objects across scenes and generate diverse scenes containing specific objects by learning a set of global object-centric representations.
Experimental results substantiate the efficacy of the proposed method, demonstrating remarkable proficiency in global object-centric representation learning, object identification, scene generation with specific objects and scene decomposition.
arXiv Detail & Related papers (2024-10-24T14:57:00Z) - In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning [95.6696714640357]
We propose a new task 'open-world video instance segmentation and captioning'<n>It requires to detect, segment, track and describe with rich captions never before seen objects.<n>We develop an object abstractor and an object-to-text abstractor.
arXiv Detail & Related papers (2024-04-04T17:59:58Z) - Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring [27.45225442048711]
We introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts.
We design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models.
Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting.
arXiv Detail & Related papers (2024-03-14T12:21:37Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Learning Object-Centric Representations of Multi-Object Scenes from
Multiple Views [9.556376932449187]
Multi-View and Multi-Object Network (MulMON) is a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views.
We show that MulMON better-resolves spatial ambiguities than single-view methods.
arXiv Detail & Related papers (2021-11-13T13:54:28Z) - Look-into-Object: Self-supervised Structure Modeling for Object
Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions.
We show the recognition backbone can be substantially enhanced for more robust representation learning.
Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.