Efficient Zero-shot Visual Search via Target and Context-aware
Transformer
- URL: http://arxiv.org/abs/2211.13470v1
- Date: Thu, 24 Nov 2022 08:27:47 GMT
- Title: Efficient Zero-shot Visual Search via Target and Context-aware
Transformer
- Authors: Zhiwei Ding, Xuezhe Ren, Erwan David, Melissa Vo, Gabriel Kreiman,
Mengmi Zhang
- Abstract summary: We propose a zero-shot deep learning architecture, TCT, that modulates self attention in the Vision Transformer with target and contextual relevant information.
We conduct visual search experiments on TCT and other competitive visual search models on three natural scene datasets with varying levels of difficulty.
TCT demonstrates human-like performance in terms of search efficiency and beats the SOTA models in challenging visual search tasks.
- Score: 5.652978777706897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual search is a ubiquitous challenge in natural vision, including daily
tasks such as finding a friend in a crowd or searching for a car in a parking
lot. Human rely heavily on relevant target features to perform goal-directed
visual search. Meanwhile, context is of critical importance for locating a
target object in complex scenes as it helps narrow down the search area and
makes the search process more efficient. However, few works have combined both
target and context information in visual search computational models. Here we
propose a zero-shot deep learning architecture, TCT (Target and Context-aware
Transformer), that modulates self attention in the Vision Transformer with
target and contextual relevant information to enable human-like zero-shot
visual search performance. Target modulation is computed as patch-wise local
relevance between the target and search images, whereas contextual modulation
is applied in a global fashion. We conduct visual search experiments on TCT and
other competitive visual search models on three natural scene datasets with
varying levels of difficulty. TCT demonstrates human-like performance in terms
of search efficiency and beats the SOTA models in challenging visual search
tasks. Importantly, TCT generalizes well across datasets with novel objects
without retraining or fine-tuning. Furthermore, we also introduce a new dataset
to benchmark models for invariant visual search under incongruent contexts. TCT
manages to search flexibly via target and context modulation, even under
incongruent contexts.
Related papers
- Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model [3.3772986620114387]
We introduce ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features.
Our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
arXiv Detail & Related papers (2024-04-19T07:24:32Z) - Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans.
This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations.
In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Target Features Affect Visual Search, A Study of Eye Fixations [2.7920304852537527]
We investigate how the performance of human participants during visual search is affected by different parameters.
Our studies show that a bigger and more eccentric target is found faster with fewer number of fixations.
arXiv Detail & Related papers (2022-09-28T01:53:16Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - RelViT: Concept-guided Vision Transformer for Visual Relational
Reasoning [139.0548263507796]
We use vision transformers (ViTs) as our base model for visual reasoning.
We make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs.
We show the resulting model, Concept-guided Vision Transformer (or RelViT for short), significantly outperforms prior approaches on HICO and GQA benchmarks.
arXiv Detail & Related papers (2022-04-24T02:46:43Z) - Global-Local Context Network for Person Search [125.51080862575326]
Person search aims to jointly localize and identify a query person from natural, uncropped images.
We exploit rich context information globally and locally surrounding the target person, which we refer to scene and group context, respectively.
We propose a unified global-local context network (GLCNet) with the intuitive aim of feature enhancement.
arXiv Detail & Related papers (2021-12-05T07:38:53Z) - Searching the Search Space of Vision Transformer [98.96601221383209]
Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection.
We propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space.
We provide design guidelines of general vision transformers with extensive analysis according to the space searching process.
arXiv Detail & Related papers (2021-11-29T17:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.