2.5D Visual Relationship Detection
- URL: http://arxiv.org/abs/2104.12727v1
- Date: Mon, 26 Apr 2021 17:19:10 GMT
- Title: 2.5D Visual Relationship Detection
- Authors: Yu-Chuan Su, Soravit Changpinyo, Xiangning Chen, Sathish Thoppay,
Cho-Jui Hsieh, Lior Shapira, Radu Soricut, Hartwig Adam, Matthew Brown,
Ming-Hsuan Yang, Boqing Gong
- Abstract summary: We study 2.5D visual relationship detection (2.5VRD)
Unlike general VRD, 2.5VRD is egocentric, using the camera's viewpoint as a common reference for all 2.5D relationships.
We create a new dataset consisting of 220k human-annotated 2.5D relationships among 512K objects from 11K images.
- Score: 142.69699509655428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual 2.5D perception involves understanding the semantics and geometry of a
scene through reasoning about object relationships with respect to the viewer
in an environment. However, existing works in visual recognition primarily
focus on the semantics. To bridge this gap, we study 2.5D visual relationship
detection (2.5VRD), in which the goal is to jointly detect objects and predict
their relative depth and occlusion relationships. Unlike general VRD, 2.5VRD is
egocentric, using the camera's viewpoint as a common reference for all 2.5D
relationships. Unlike depth estimation, 2.5VRD is object-centric and not only
focuses on depth. To enable progress on this task, we create a new dataset
consisting of 220k human-annotated 2.5D relationships among 512K objects from
11K images. We analyze this dataset and conduct extensive experiments including
benchmarking multiple state-of-the-art VRD models on this task. Our results
show that existing models largely rely on semantic cues and simple heuristics
to solve 2.5VRD, motivating further research on models for 2.5D perception. The
new dataset is available at https://github.com/google-research-datasets/2.5vrd.
Related papers
- Interpretable Action Recognition on Hard to Classify Actions [11.641926922266347]
Humans recognise complex activities in video by recognising critical-temporal relations among explicitly recognised objects and parts.
To mimic this we build on a model which uses positions of objects and hands, and their motions, to recognise the activity taking place.
To improve this model we focussed on three of the most confused classes (for this model) and identified that the lack of 3D information was the major problem.
A state-of-the-art object detection model was fine-tuned to determine the difference between "Container" and "NotContainer" in order to integrate object shape information into the existing object features.
arXiv Detail & Related papers (2024-09-19T21:23:44Z) - Open World Object Detection in the Era of Foundation Models [53.683963161370585]
We introduce a new benchmark that includes five real-world application-driven datasets.
We introduce a novel method, Foundation Object detection Model for the Open world, or FOMO, which identifies unknown objects based on their shared attributes with the base known objects.
arXiv Detail & Related papers (2023-12-10T03:56:06Z) - InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot
Interactions [23.296139146133573]
We present a large-scale dataset, invig, for interactive visual grounding under language ambiguity.
Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues.
To the best of our knowledge, the invig dataset is the first large-scale dataset for resolving open-ended interactive visual grounding.
arXiv Detail & Related papers (2023-10-18T17:57:05Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - 4D Unsupervised Object Discovery [53.561750858325915]
We propose 4D unsupervised object discovery, jointly discovering objects from 4D data -- 3D point clouds and 2D RGB images with temporal information.
We present the first practical approach for this task by proposing a ClusterNet on 3D point clouds, which is jointly optimized with a 2D localization network.
arXiv Detail & Related papers (2022-10-10T16:05:53Z) - Mutual Adaptive Reasoning for Monocular 3D Multi-Person Pose Estimation [45.06447187321217]
Most existing bottom-up methods treat camera-centric 3D human pose estimation as two unrelated subtasks.
We propose a unified model that leverages the mutual benefits of both these subtasks.
Our model runs much faster than existing bottom-up and top-down methods.
arXiv Detail & Related papers (2022-07-16T10:54:40Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.