Universal Instance Perception as Object Discovery and Retrieval
- URL: http://arxiv.org/abs/2303.06674v2
- Date: Thu, 17 Aug 2023 07:50:28 GMT
- Title: Universal Instance Perception as Object Discovery and Retrieval
- Authors: Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan,
Huchuan Lu
- Abstract summary: UNI reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm.
It can flexibly perceive different types of objects by simply changing the input prompts.
UNI shows superior performance on 20 challenging benchmarks from 10 instance-level tasks.
- Score: 90.96031157557806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: All instance perception tasks aim at finding certain objects specified by
some queries such as category names, language expressions, and target
annotations, but this complete field has been split into multiple independent
subtasks. In this work, we present a universal instance perception model of the
next generation, termed UNINEXT. UNINEXT reformulates diverse instance
perception tasks into a unified object discovery and retrieval paradigm and can
flexibly perceive different types of objects by simply changing the input
prompts. This unified formulation brings the following benefits: (1) enormous
data from different tasks and label vocabularies can be exploited for jointly
training general instance-level representations, which is especially beneficial
for tasks lacking in training data. (2) the unified model is
parameter-efficient and can save redundant computation when handling multiple
tasks simultaneously. UNINEXT shows superior performance on 20 challenging
benchmarks from 10 instance-level tasks including classical image-level tasks
(object detection and instance segmentation), vision-and-language tasks
(referring expression comprehension and segmentation), and six video-level
object tracking tasks. Code is available at
https://github.com/MasterBin-IIAU/UNINEXT.
Related papers
- UniFS: Universal Few-shot Instance Perception with Point Representations [36.943019984075065]
We propose UniFS, a universal few-shot instance perception model that unifies a wide range of instance perception tasks.
Our approach makes minimal assumptions about the tasks, yet it achieves competitive results compared to highly specialized and well optimized specialist models.
arXiv Detail & Related papers (2024-04-30T09:47:44Z) - DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding [7.470587868134298]
Point scene understanding is a challenging task to process real-world scene point cloud.
Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks.
We propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation.
arXiv Detail & Related papers (2024-03-25T05:22:34Z) - Distribution Matching for Multi-Task Learning of Classification Tasks: a
Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space.
We show that MTL can be successful with classification tasks with little, or non-overlapping annotations.
We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection [42.2847114428716]
Task driven object detection aims to detect object instances suitable for affording a task in an image.
Its challenge lies in object categories available for the task being too diverse to be limited to a closed set of object vocabulary for traditional object detection.
We propose to explore fundamental affordances rather than object categories, i.e., common attributes that enable different objects to accomplish the same task.
arXiv Detail & Related papers (2023-09-03T06:18:39Z) - A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks.
These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation.
Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z) - BURST: A Benchmark for Unifying Object Recognition, Segmentation and
Tracking in Video [58.71785546245467]
Multiple existing benchmarks involve tracking and segmenting objects in video.
There is little interaction between them due to the use of disparate benchmark datasets and metrics.
We propose BURST, a dataset which contains thousands of diverse videos with high-quality object masks.
All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison.
arXiv Detail & Related papers (2022-09-25T01:27:35Z) - FindIt: Generalized Localization with Natural Language Queries [43.07139534653485]
FindIt is a simple and versatile framework that unifies a variety of visual grounding and localization tasks.
Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements.
Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries.
arXiv Detail & Related papers (2022-03-31T17:59:30Z) - Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture.
Our models learn to generate labels in text based on the visual and textual inputs.
Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.