InstructDET: Diversifying Referring Object Detection with Generalized
Instructions
- URL: http://arxiv.org/abs/2310.05136v5
- Date: Mon, 11 Mar 2024 07:32:31 GMT
- Title: InstructDET: Diversifying Referring Object Detection with Generalized
Instructions
- Authors: Ronghao Dang, Jiangyan Feng, Haodong Zhang, Chongjian Ge, Lin Song,
Lijun Gong, Chengju Liu, Qijun Chen, Feng Zhu, Rui Zhao, Yibing Song
- Abstract summary: We propose a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions.
For one image, we produce tremendous instructions that refer to every single object and different combinations of multiple objects.
- Score: 39.36186258308405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose InstructDET, a data-centric method for referring object detection
(ROD) that localizes target objects based on user instructions. While deriving
from referring expressions (REC), the instructions we leverage are greatly
diversified to encompass common user intentions related to object detection.
For one image, we produce tremendous instructions that refer to every single
object and different combinations of multiple objects. Each instruction and its
corresponding object bounding boxes (bbxs) constitute one training data pair.
In order to encompass common detection expressions, we involve emerging
vision-language model (VLM) and large language model (LLM) to generate
instructions guided by text prompts and object bbxs, as the generalizations of
foundation models are effective to produce human-like expressions (e.g.,
describing object property, category, and relationship). We name our
constructed dataset as InDET. It contains images, bbxs and generalized
instructions that are from foundation models. Our InDET is developed from
existing REC datasets and object detection datasets, with the expanding
potential that any image with object bbxs can be incorporated through using our
InstructDET method. By using our InDET dataset, we show that a conventional ROD
model surpasses existing methods on standard REC datasets and our InDET test
set. Our data-centric method InstructDET, with automatic data expansion by
leveraging foundation models, directs a promising field that ROD can be greatly
diversified to execute common object detection instructions.
Related papers
- YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary [12.39040757106137]
We introduce an innovative em textbfRetriever-emtextbfDictionary (RD) module to address this issue.
This architecture enables YOLO-based models to efficiently retrieve features from a Dictionary that contains the insight of the dataset.
arXiv Detail & Related papers (2024-10-20T09:38:58Z) - Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references.
We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting.
Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z) - UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics.
We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z) - Learning-To-Rank Approach for Identifying Everyday Objects Using a
Physical-World Search Engine [0.8749675983608172]
We focus on the task of retrieving target objects from open-vocabulary user instructions in a human-in-the-loop setting.
We propose MultiRankIt, which is a novel approach for the learning-to-rank physical objects task.
arXiv Detail & Related papers (2023-12-26T01:40:31Z) - Open World Object Detection in the Era of Foundation Models [53.683963161370585]
We introduce a new benchmark that includes five real-world application-driven datasets.
We introduce a novel method, Foundation Object detection Model for the Open world, or FOMO, which identifies unknown objects based on their shared attributes with the base known objects.
arXiv Detail & Related papers (2023-12-10T03:56:06Z) - Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users.
We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set.
Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z) - Described Object Detection: Liberating Object Detection with Flexible
Expressions [19.392927971139652]
We advance Open-Vocabulary object Detection (OVD) and Referring Expression (REC) to Described Object Detection (DOD)
In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD.
This dataset features flexible language expressions, whether short category names or long descriptions, and annotating all described objects on all images without omission.
arXiv Detail & Related papers (2023-07-24T14:06:54Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.