FindIt: Generalized Localization with Natural Language Queries
- URL: http://arxiv.org/abs/2203.17273v1
- Date: Thu, 31 Mar 2022 17:59:30 GMT
- Title: FindIt: Generalized Localization with Natural Language Queries
- Authors: Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar,
Anelia Angelova
- Abstract summary: FindIt is a simple and versatile framework that unifies a variety of visual grounding and localization tasks.
Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements.
Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries.
- Score: 43.07139534653485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose FindIt, a simple and versatile framework that unifies a variety of
visual grounding and localization tasks including referring expression
comprehension, text-based localization, and object detection. Key to our
architecture is an efficient multi-scale fusion module that unifies the
disparate localization requirements across the tasks. In addition, we discover
that a standard object detector is surprisingly effective in unifying these
tasks without a need for task-specific design, losses, or pre-computed
detections. Our end-to-end trainable framework responds flexibly and accurately
to a wide range of referring expression, localization or detection queries for
zero, one, or multiple objects. Jointly trained on these tasks, FindIt
outperforms the state of the art on both referring expression and text-based
localization, and shows competitive performance on object detection. Finally,
FindIt generalizes better to out-of-distribution data and novel categories
compared to strong single-task baselines. All of these are accomplished by a
single, unified and efficient model. The code will be released.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - GMC: A General Framework of Multi-stage Context Learning and Utilization for Visual Detection Tasks [10.840556935747784]
A general framework is proposed for multistage context learning and utilization, with various deep network architectures for various visual detection tasks.
The proposed framework provides a comprehensive and adaptable solution for context learning and utilization in visual detection scenarios.
arXiv Detail & Related papers (2024-07-08T02:54:09Z) - CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection [42.2847114428716]
Task driven object detection aims to detect object instances suitable for affording a task in an image.
Its challenge lies in object categories available for the task being too diverse to be limited to a closed set of object vocabulary for traditional object detection.
We propose to explore fundamental affordances rather than object categories, i.e., common attributes that enable different objects to accomplish the same task.
arXiv Detail & Related papers (2023-09-03T06:18:39Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Universal Instance Perception as Object Discovery and Retrieval [90.96031157557806]
UNI reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm.
It can flexibly perceive different types of objects by simply changing the input prompts.
UNI shows superior performance on 20 challenging benchmarks from 10 instance-level tasks.
arXiv Detail & Related papers (2023-03-12T14:28:24Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Improving Visual Grounding with Visual-Linguistic Verification and
Iterative Reasoning [42.29650807349636]
We propose a transformer-based framework for accurate visual grounding.
We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions.
A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
arXiv Detail & Related papers (2022-04-30T13:48:15Z) - Towards Accurate Localization by Instance Search [2.0539994999823334]
A self-paced learning framework is proposed to achieve accurate object localization on the rank list returned by instance search.
The proposed framework mines the target instance gradually from the queries and their corresponding top-ranked search results.
In addition to performing localization on instance search, the issue of few-shot object detection is also addressed under the same framework.
arXiv Detail & Related papers (2021-07-11T10:03:31Z) - Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture.
Our models learn to generate labels in text based on the visual and textual inputs.
Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.