DetGPT: Detect What You Need via Reasoning
- URL: http://arxiv.org/abs/2305.14167v2
- Date: Wed, 24 May 2023 02:51:37 GMT
- Title: DetGPT: Detect What You Need via Reasoning
- Authors: Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang,
Lewei Yao, Jianhua Han, Hang Xu, Lingpeng Kong, Tong Zhang
- Abstract summary: We introduce a new paradigm for object detection that we call reasoning-based object detection.
Unlike conventional object detection methods that rely on specific object names, our approach enables users to interact with the system using natural language instructions.
Our proposed method, called DetGPT, leverages state-of-the-art multi-modal models and open-vocabulary object detectors.
- Score: 33.00345609506097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the field of computer vision has seen significant
advancements thanks to the development of large language models (LLMs). These
models have enabled more effective and sophisticated interactions between
humans and machines, paving the way for novel techniques that blur the lines
between human and machine intelligence. In this paper, we introduce a new
paradigm for object detection that we call reasoning-based object detection.
Unlike conventional object detection methods that rely on specific object
names, our approach enables users to interact with the system using natural
language instructions, allowing for a higher level of interactivity. Our
proposed method, called DetGPT, leverages state-of-the-art multi-modal models
and open-vocabulary object detectors to perform reasoning within the context of
the user's instructions and the visual scene. This enables DetGPT to
automatically locate the object of interest based on the user's expressed
desires, even if the object is not explicitly mentioned. For instance, if a
user expresses a desire for a cold beverage, DetGPT can analyze the image,
identify a fridge, and use its knowledge of typical fridge contents to locate
the beverage. This flexibility makes our system applicable across a wide range
of fields, from robotics and automation to autonomous driving. Overall, our
proposed paradigm and DetGPT demonstrate the potential for more sophisticated
and intuitive interactions between humans and machines. We hope that our
proposed paradigm and approach will provide inspiration to the community and
open the door to more interative and versatile object detection systems. Our
project page is launched at detgpt.github.io.
Related papers
- Generating Human-Centric Visual Cues for Human-Object Interaction
Detection via Large Vision-Language Models [59.611697856666304]
Human-object interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions.
We propose three prompts with VLM to generate human-centric visual cues within an image from multiple perspectives of humans.
We develop a transformer-based multimodal fusion module with multitower architecture to integrate visual cue features into the instance and interaction decoders.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Tactile-Filter: Interactive Tactile Perception for Part Mating [54.46221808805662]
Humans rely on touch and tactile sensing for a lot of dexterous manipulation tasks.
vision-based tactile sensors are being widely used for various robotic perception and control tasks.
We present a method for interactive perception using vision-based tactile sensors for a part mating task.
arXiv Detail & Related papers (2023-03-10T16:27:37Z) - Weakly-Supervised HOI Detection from Interaction Labels Only and
Language/Vision-Language Priors [36.75629570208193]
Human-object interaction (HOI) detection aims to extract interacting human-object pairs and their interaction categories from a given natural image.
In this paper, we tackle HOI detection with the weakest supervision setting in the literature, using only image-level interaction labels.
We first propose an approach to prune non-interacting human and object proposals to increase the quality of positive pairs within the bag, exploiting the grounding capability of the vision-language model.
Second, we use a large language model to query which interactions are possible between a human and a given object category, in order to force the model not to put emphasis
arXiv Detail & Related papers (2023-03-09T19:08:02Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - Dynamic Modeling of Hand-Object Interactions via Tactile Sensing [133.52375730875696]
In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects.
We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model.
This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing.
arXiv Detail & Related papers (2021-09-09T16:04:14Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Learning Intuitive Physics with Multimodal Generative Models [24.342994226226786]
This paper presents a perception framework that fuses visual and tactile feedback to make predictions about the expected motion of objects in dynamic scenes.
We use a novel See-Through-your-Skin (STS) sensor that provides high resolution multimodal sensing of contact surfaces.
We validate through simulated and real-world experiments in which the resting state of an object is predicted from given initial conditions.
arXiv Detail & Related papers (2021-01-12T12:55:53Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.