Seeing the Unseen: Visual Common Sense for Semantic Placement
- URL: http://arxiv.org/abs/2401.07770v1
- Date: Mon, 15 Jan 2024 15:28:30 GMT
- Title: Seeing the Unseen: Visual Common Sense for Semantic Placement
- Authors: Ram Ramrakhya, Aniruddha Kembhavi, Dhruv Batra, Zsolt Kira, Kuo-Hao
Zeng, Luca Weihs
- Abstract summary: Given an image, a vision system is asked to predict semantically-meaningful regions (masks or bounding boxes) in the image where that object could be placed or is likely be placed by humans.
We call this task: Semantic Placement (SP) and believe that such common-sense visual understanding is critical for assitive robots (tidying a house), and AR devices (automatically rendering an object in the user's space)
- Score: 71.76026880991245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Computer vision tasks typically involve describing what is present in an
image (e.g. classification, detection, segmentation, and captioning). We study
a visual common sense task that requires understanding what is not present.
Specifically, given an image (e.g. of a living room) and name of an object
("cushion"), a vision system is asked to predict semantically-meaningful
regions (masks or bounding boxes) in the image where that object could be
placed or is likely be placed by humans (e.g. on the sofa). We call this task:
Semantic Placement (SP) and believe that such common-sense visual understanding
is critical for assitive robots (tidying a house), and AR devices
(automatically rendering an object in the user's space). Studying the invisible
is hard. Datasets for image description are typically constructed by curating
relevant images and asking humans to annotate the contents of the image;
neither of those two steps are straightforward for objects not present in the
image. We overcome this challenge by operating in the opposite direction: we
start with an image of an object in context from web, and then remove that
object from the image via inpainting. This automated pipeline converts
unstructured web data into a dataset comprising pairs of images with/without
the object. Using this, we collect a novel dataset, with ${\sim}1.3$M images
across $9$ object categories, and train a SP prediction model called CLIP-UNet.
CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors
with object detectors on real-world and simulated images. In our user studies,
we find that the SP masks predicted by CLIP-UNet are favored $43.7\%$ and
$31.3\%$ times when comparing against the $4$ SP baselines on real and
simulated images. In addition, we demonstrate leveraging SP mask predictions
from CLIP-UNet enables downstream applications like building tidying robots in
indoor environments.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Are These the Same Apple? Comparing Images Based on Object Intrinsics [27.43687450076182]
Measure image similarity purely based on intrinsic object properties that define object identity.
This problem has been studied in the computer vision literature as re-identification.
We propose to extend it to general object categories, exploring an image similarity metric based on object intrinsics.
arXiv Detail & Related papers (2023-11-01T18:00:03Z) - What Can Human Sketches Do for Object Detection? [127.67444974452411]
Sketches are highly expressive, inherently capturing subjective and fine-grained visual cues.
A sketch-enabled object detection framework detects based on what textityou sketch -- textitthat zebra''
We show an intuitive synergy between foundation models (e.g., CLIP) and existing sketch models build for sketch-based image retrieval (SBIR)
In particular, we first perform independent on both sketch branches of an encoder model to build highly generalisable sketch and photo encoders.
arXiv Detail & Related papers (2023-03-27T12:33:23Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Hyperbolic Contrastive Learning for Visual Representations beyond
Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z) - Complex Scene Image Editing by Scene Graph Comprehension [17.72638225034884]
We propose a two-stage method for achieving complex scene image editing by Scene Graph (SGC-Net)
In the first stage, we train a Region of Interest (RoI) prediction network that uses scene graphs and predict the locations of the target objects.
The second stage uses a conditional diffusion model to edit the image based on our RoI predictions.
arXiv Detail & Related papers (2022-03-24T05:12:54Z) - Compositional Sketch Search [91.84489055347585]
We present an algorithm for searching image collections using free-hand sketches.
We exploit drawings as a concise and intuitive representation for specifying entire scene compositions.
arXiv Detail & Related papers (2021-06-15T09:38:09Z) - Action Image Representation: Learning Scalable Deep Grasping Policies
with Zero Real World Data [12.554739620645917]
Action Image represents a grasp proposal as an image and uses a deep convolutional network to infer grasp quality.
We show that this representation works on a variety of inputs, including color images (RGB), depth images (D) and combined color-depth (RGB-D)
arXiv Detail & Related papers (2020-05-13T21:40:21Z) - Self-Supervised Viewpoint Learning From Image Collections [116.56304441362994]
We propose a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner.
We show that our approach performs competitively to fully-supervised approaches for several object categories like human faces, cars, buses, and trains.
arXiv Detail & Related papers (2020-04-03T22:01:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.