Multimodal Query-guided Object Localization
- URL: http://arxiv.org/abs/2212.00749v2
- Date: Wed, 24 Jul 2024 14:00:50 GMT
- Title: Multimodal Query-guided Object Localization
- Authors: Aditay Tripathi, Rajath R Dani, Anand Mishra, Anirban Chakraborty,
- Abstract summary: We present a multimodal query-guided object localization approach under the challenging open-set setting.
In particular, we use queries from two modalities, namely, hand-drawn sketch and description of the object.
We present a novel cross-modal attention scheme that guides the region proposal network to generate object proposals.
- Score: 5.424592317916519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Consider a scenario in one-shot query-guided object localization where neither an image of the object nor the object category name is available as a query. In such a scenario, a hand-drawn sketch of the object could be a choice for a query. However, hand-drawn crude sketches alone, when used as queries, might be ambiguous for object localization, e.g., a sketch of a laptop could be confused for a sofa. On the other hand, a linguistic definition of the category, e.g., a small portable computer small enough to use in your lap" along with the sketch query, gives better visual and semantic cues for object localization. In this work, we present a multimodal query-guided object localization approach under the challenging open-set setting. In particular, we use queries from two modalities, namely, hand-drawn sketch and description of the object (also known as gloss), to perform object localization. Multimodal query-guided object localization is a challenging task, especially when a large domain gap exists between the queries and the natural images, as well as due to the challenge of combining the complementary and minimal information present across the queries. For example, hand-drawn crude sketches contain abstract shape information of an object, while the text descriptions often capture partial semantic information about a given object category. To address the aforementioned challenges, we present a novel cross-modal attention scheme that guides the region proposal network to generate object proposals relevant to the input queries and a novel orthogonal projection-based proposal scoring technique that scores each proposal with respect to the queries, thereby yielding the final localization results. ...
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open
Environments [170.43912741137655]
We construct a comprehensive dataset called Reasoning Intention-Oriented Objects (RIO)
RIO is specifically designed to incorporate diverse real-world scenarios and a wide range of object categories.
We evaluate the ability of some existing models to reason intention-oriented objects in open environments.
arXiv Detail & Related papers (2023-10-26T10:15:21Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - Query-guided Attention in Vision Transformers for Localizing Objects
Using a Single Sketch [17.63475613154152]
Given a crude hand-drawn sketch of an object, the goal is to localize all instances of the same object on the target image.
This problem proves difficult due to the abstract nature of hand-drawn sketches, variations in the style and quality of sketches, and the large domain gap existing between the sketches and the natural images.
We propose a sketch-guided vision transformer encoder that uses cross-attention after each block of the transformer-based image encoder to learn query-conditioned image features.
arXiv Detail & Related papers (2023-03-15T17:26:17Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - FindIt: Generalized Localization with Natural Language Queries [43.07139534653485]
FindIt is a simple and versatile framework that unifies a variety of visual grounding and localization tasks.
Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements.
Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries.
arXiv Detail & Related papers (2022-03-31T17:59:30Z) - Object Priors for Classifying and Localizing Unseen Actions [45.91275361696107]
We propose three spatial object priors, which encode local person and object detectors along with their spatial relations.
On top we introduce three semantic object priors, which extend semantic matching through word embeddings.
A video embedding combines the spatial and semantic object priors.
arXiv Detail & Related papers (2021-04-10T08:56:58Z) - Prototypical Region Proposal Networks for Few-Shot Localization and
Classification [1.5100087942838936]
We develop a framework to unifysegmentation and classification into an end-to-end classification model -- PRoPnet.
We empirically demonstrate that our methods improve accuracy on image datasets with natural scenes containing multiple object classes.
arXiv Detail & Related papers (2021-04-08T04:03:30Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - Sketch-Guided Object Localization in Natural Images [16.982683600384277]
We introduce the novel problem of localizing all instances of an object (seen or unseen during training) in a natural image via sketch query.
We propose a novel cross-modal attention scheme that guides the region proposal network (RPN) to generate object proposals relevant to the sketch query.
Our method is effective with as little as a single sketch query.
arXiv Detail & Related papers (2020-08-14T19:35:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.