DetGPT: Detect What You Need via Reasoning
- URL: http://arxiv.org/abs/2305.14167v2
- Date: Wed, 24 May 2023 02:51:37 GMT
- Title: DetGPT: Detect What You Need via Reasoning
- Authors: Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang,
Lewei Yao, Jianhua Han, Hang Xu, Lingpeng Kong, Tong Zhang
- Abstract summary: We introduce a new paradigm for object detection that we call reasoning-based object detection.
Unlike conventional object detection methods that rely on specific object names, our approach enables users to interact with the system using natural language instructions.
Our proposed method, called DetGPT, leverages state-of-the-art multi-modal models and open-vocabulary object detectors.
- Score: 33.00345609506097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the field of computer vision has seen significant
advancements thanks to the development of large language models (LLMs). These
models have enabled more effective and sophisticated interactions between
humans and machines, paving the way for novel techniques that blur the lines
between human and machine intelligence. In this paper, we introduce a new
paradigm for object detection that we call reasoning-based object detection.
Unlike conventional object detection methods that rely on specific object
names, our approach enables users to interact with the system using natural
language instructions, allowing for a higher level of interactivity. Our
proposed method, called DetGPT, leverages state-of-the-art multi-modal models
and open-vocabulary object detectors to perform reasoning within the context of
the user's instructions and the visual scene. This enables DetGPT to
automatically locate the object of interest based on the user's expressed
desires, even if the object is not explicitly mentioned. For instance, if a
user expresses a desire for a cold beverage, DetGPT can analyze the image,
identify a fridge, and use its knowledge of typical fridge contents to locate
the beverage. This flexibility makes our system applicable across a wide range
of fields, from robotics and automation to autonomous driving. Overall, our
proposed paradigm and DetGPT demonstrate the potential for more sophisticated
and intuitive interactions between humans and machines. We hope that our
proposed paradigm and approach will provide inspiration to the community and
open the door to more interative and versatile object detection systems. Our
project page is launched at detgpt.github.io.
Related papers
- Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP.
Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction.
Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Tactile-Filter: Interactive Tactile Perception for Part Mating [54.46221808805662]
Humans rely on touch and tactile sensing for a lot of dexterous manipulation tasks.
vision-based tactile sensors are being widely used for various robotic perception and control tasks.
We present a method for interactive perception using vision-based tactile sensors for a part mating task.
arXiv Detail & Related papers (2023-03-10T16:27:37Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - Dynamic Modeling of Hand-Object Interactions via Tactile Sensing [133.52375730875696]
In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects.
We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model.
This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing.
arXiv Detail & Related papers (2021-09-09T16:04:14Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Learning Intuitive Physics with Multimodal Generative Models [24.342994226226786]
This paper presents a perception framework that fuses visual and tactile feedback to make predictions about the expected motion of objects in dynamic scenes.
We use a novel See-Through-your-Skin (STS) sensor that provides high resolution multimodal sensing of contact surfaces.
We validate through simulated and real-world experiments in which the resting state of an object is predicted from given initial conditions.
arXiv Detail & Related papers (2021-01-12T12:55:53Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.