Hyperbolic Contrastive Learning for Visual Representations beyond
Objects
- URL: http://arxiv.org/abs/2212.00653v1
- Date: Thu, 1 Dec 2022 16:58:57 GMT
- Title: Hyperbolic Contrastive Learning for Visual Representations beyond
Objects
- Authors: Songwei Ge, Shlok Mishra, Simon Kornblith, Chun-Liang Li, David Jacobs
- Abstract summary: We focus on learning representations for objects and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
- Score: 30.618032825306187
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Although self-/un-supervised methods have led to rapid progress in visual
representation learning, these methods generally treat objects and scenes using
the same lens. In this paper, we focus on learning representations for objects
and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the
representation space, we argue that the scenes and objects should instead
follow a hierarchical structure based on their compositionality. To exploit
such a structure, we propose a contrastive learning framework where a Euclidean
loss is used to learn object representations and a hyperbolic loss is used to
encourage representations of scenes to lie close to representations of their
constituent objects in a hyperbolic space. This novel hyperbolic objective
encourages the scene-object hypernymy among the representations by optimizing
the magnitude of their norms. We show that when pretraining on the COCO and
OpenImages datasets, the hyperbolic loss improves downstream performance of
several baselines across multiple datasets and tasks, including image
classification, object detection, and semantic segmentation. We also show that
the properties of the learned representations allow us to solve various vision
tasks that involve the interaction between scenes and objects in a zero-shot
fashion. Our code can be found at
\url{https://github.com/shlokk/HCL/tree/main/HCL}.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Spotlight Attention: Robust Object-Centric Learning With a Spatial
Locality Prior [88.9319150230121]
Object-centric vision aims to construct an explicit representation of the objects in a scene.
We incorporate a spatial-locality prior into state-of-the-art object-centric vision models.
We obtain significant improvements in segmenting objects in both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-31T04:35:50Z) - Object-Compositional Neural Implicit Surfaces [45.274466719163925]
The neural implicit representation has shown its effectiveness in novel view synthesis and high-quality 3D reconstruction from multi-view images.
This paper proposes a novel framework, ObjectSDF, to build an object-compositional neural implicit representation with high fidelity in 3D reconstruction and object representation.
arXiv Detail & Related papers (2022-07-20T06:38:04Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Continuous Scene Representations for Embodied AI [33.00565252990522]
Continuous Scene Representations (CSR) is a scene representation constructed by an embodied agent navigating within a space.
Our key insight is to embed pair-wise relationships between objects in a latent space.
CSR can track objects as the agent moves in a scene, update the representation accordingly, and detect changes in room configurations.
arXiv Detail & Related papers (2022-03-31T17:55:33Z) - Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels.
Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions.
We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z) - Unsupervised Part Discovery from Contrastive Reconstruction [90.88501867321573]
The goal of self-supervised visual representation learning is to learn strong, transferable image representations.
We propose an unsupervised approach to object part discovery and segmentation.
Our method yields semantic parts consistent across fine-grained but visually distinct categories.
arXiv Detail & Related papers (2021-11-11T17:59:42Z) - Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner.
We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z) - Image Captioning with Visual Object Representations Grounded in the
Textual Modality [14.797241131469486]
We explore the possibilities of a shared embedding space between textual and visual modality.
We propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system.
arXiv Detail & Related papers (2020-10-19T12:21:38Z) - Deriving Visual Semantics from Spatial Context: An Adaptation of LSA and
Word2Vec to generate Object and Scene Embeddings from Images [0.0]
We develop two approaches for learning object and scene embeddings from annotated images.
In the first approach, we generate embeddings from object co-occurrences in whole images, one for objects and one for scenes.
In the second approach, rather than analyzing whole images of scenes, we focus on co-occurrences of objects within subregions of an image.
arXiv Detail & Related papers (2020-09-20T08:26:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.