Related papers: DetGPT: Detect What You Need via Reasoning

DetGPT: Detect What You Need via Reasoning

URL: http://arxiv.org/abs/2305.14167v2
Date: Wed, 24 May 2023 02:51:37 GMT
Title: DetGPT: Detect What You Need via Reasoning
Authors: Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, Lingpeng Kong, Tong Zhang
Abstract summary: We introduce a new paradigm for object detection that we call reasoning-based object detection. Unlike conventional object detection methods that rely on specific object names, our approach enables users to interact with the system using natural language instructions. Our proposed method, called DetGPT, leverages state-of-the-art multi-modal models and open-vocabulary object detectors.
Score: 33.00345609506097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, the field of computer vision has seen significant advancements thanks to the development of large language models (LLMs). These models have enabled more effective and sophisticated interactions between humans and machines, paving the way for novel techniques that blur the lines between human and machine intelligence. In this paper, we introduce a new paradigm for object detection that we call reasoning-based object detection. Unlike conventional object detection methods that rely on specific object names, our approach enables users to interact with the system using natural language instructions, allowing for a higher level of interactivity. Our proposed method, called DetGPT, leverages state-of-the-art multi-modal models and open-vocabulary object detectors to perform reasoning within the context of the user's instructions and the visual scene. This enables DetGPT to automatically locate the object of interest based on the user's expressed desires, even if the object is not explicitly mentioned. For instance, if a user expresses a desire for a cold beverage, DetGPT can analyze the image, identify a fridge, and use its knowledge of typical fridge contents to locate the beverage. This flexibility makes our system applicable across a wide range of fields, from robotics and automation to autonomous driving. Overall, our proposed paradigm and DetGPT demonstrate the potential for more sophisticated and intuitive interactions between humans and machines. We hope that our proposed paradigm and approach will provide inspiration to the community and open the door to more interative and versatile object detection systems. Our project page is launched at detgpt.github.io.

Related papers

ObjectFinder: An Open-Vocabulary Assistive System for Interactive Object Search by Blind People [42.050924675417654]
We present ObjectFinder, an open-vocabulary wearable system for interactive object search by blind people. ObjectFinder allows users to query target objects using flexible wording. It provides egocentric localization information in real-time, including distance and direction.
arXiv Detail & Related papers (2024-12-04T08:38:45Z)
Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction. Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection. We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z)
Tactile-Filter: Interactive Tactile Perception for Part Mating [54.46221808805662]
Humans rely on touch and tactile sensing for a lot of dexterous manipulation tasks. vision-based tactile sensors are being widely used for various robotic perception and control tasks. We present a method for interactive perception using vision-based tactile sensors for a part mating task.
arXiv Detail & Related papers (2023-03-10T16:27:37Z)
Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics. We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention. We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)
Dynamic Modeling of Hand-Object Interactions via Tactile Sensing [133.52375730875696]
In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects. We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model. This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing.
arXiv Detail & Related papers (2021-09-09T16:04:14Z)
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z)
Learning Intuitive Physics with Multimodal Generative Models [24.342994226226786]
This paper presents a perception framework that fuses visual and tactile feedback to make predictions about the expected motion of objects in dynamic scenes. We use a novel See-Through-your-Skin (STS) sensor that provides high resolution multimodal sensing of contact surfaces. We validate through simulated and real-world experiments in which the resting state of an object is predicted from given initial conditions.
arXiv Detail & Related papers (2021-01-12T12:55:53Z)
Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs. Our network predicts interaction points, which directly localize and classify the inter-action. Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.