Language-Mediated, Object-Centric Representation Learning
- URL: http://arxiv.org/abs/2012.15814v1
- Date: Thu, 31 Dec 2020 18:36:07 GMT
- Title: Language-Mediated, Object-Centric Representation Learning
- Authors: Ruocheng Wang, Jiayuan Mao, Samuel J. Gershman, Jiajun Wu
- Abstract summary: We present Language-mediated, Object-centric Representation Learning (LORL)
LORL is a paradigm for learning disentangled, object-centric scene representations from vision and language.
It can be integrated with various unsupervised segmentation algorithms that are language-agnostic.
- Score: 21.667413971464455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Language-mediated, Object-centric Representation Learning (LORL),
a paradigm for learning disentangled, object-centric scene representations from
vision and language. LORL builds upon recent advances in unsupervised object
segmentation, notably MONet and Slot Attention. While these algorithms learn an
object-centric representation just by reconstructing the input image, LORL
enables them to further learn to associate the learned representations to
concepts, i.e., words for object categories, properties, and spatial
relationships, from language input. These object-centric concepts derived from
language facilitate the learning of object-centric representations. LORL can be
integrated with various unsupervised segmentation algorithms that are
language-agnostic. Experiments show that the integration of LORL consistently
improves the object segmentation performance of MONet and Slot Attention on two
datasets via the help of language. We also show that concepts learned by LORL,
in conjunction with segmentation algorithms such as MONet, aid downstream tasks
such as referring expression comprehension.
Related papers
- Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations [4.807052027638089]
We present the Neural Slot Interpreter (NSI) that learns to ground and generate object semantics via slot representations.
NSI is an XML-like programming language that uses simple syntax rules to organize the object semantics of a scene into object-centric program primitives.
arXiv Detail & Related papers (2024-02-02T12:37:23Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Identifying concept libraries from language about object structure [56.83719358616503]
We leverage natural language descriptions for a diverse set of 2K procedurally generated objects to identify the parts people use.
We formalize our problem as search over a space of program libraries that contain different part concepts.
By combining naturalistic language at scale with structured program representations, we discover a fundamental information-theoretic tradeoff governing the part concepts people name.
arXiv Detail & Related papers (2022-05-11T17:49:25Z) - Self-Supervised Learning of Object Parts for Semantic Segmentation [7.99536002595393]
We argue that self-supervised learning of object parts is a solution to this issue.
Our method surpasses the state-of-the-art on three semantic segmentation benchmarks by 17%-3%.
arXiv Detail & Related papers (2022-04-27T17:55:17Z) - Integrating Language Guidance into Vision-based Deep Metric Learning [78.18860829585182]
We propose to learn metric spaces which encode semantic similarities as embedding space.
These spaces should be transferable to classes beyond those seen during training.
This causes learned embedding spaces to encode incomplete semantic context and misrepresent the semantic relation between classes.
arXiv Detail & Related papers (2022-03-16T11:06:50Z) - Object Pursuit: Building a Space of Objects via Discriminative Weight
Generation [23.85039747700698]
We propose a framework to continuously learn object-centric representations for visual learning and understanding.
We leverage interactions to sample diverse variations of an object and the corresponding training signals while learning the object-centric representations.
We perform an extensive study of the key features of the proposed framework and analyze the characteristics of the learned representations.
arXiv Detail & Related papers (2021-12-15T08:25:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.