OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned
Representation Learning
- URL: http://arxiv.org/abs/2108.03704v1
- Date: Sun, 8 Aug 2021 18:13:53 GMT
- Title: OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned
Representation Learning
- Authors: Sheng Liu, Kevin Lin, Lijuan Wang, Junsong Yuan, Zicheng Liu
- Abstract summary: We introduce the task of open-vocabulary visual instance search (OVIS)
Given an arbitrary textual search query, OVIS aims to return a ranked list of visual instances.
We propose to address such a search challenge via visual-semantic aligned representation learning (ViSA)
- Score: 79.49199857462087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the task of open-vocabulary visual instance search (OVIS). Given
an arbitrary textual search query, Open-vocabulary Visual Instance Search
(OVIS) aims to return a ranked list of visual instances, i.e., image patches,
that satisfies the search intent from an image database. The term "open
vocabulary" means that there are neither restrictions to the visual instance to
be searched nor restrictions to the word that can be used to compose the
textual search query. We propose to address such a search challenge via
visual-semantic aligned representation learning (ViSA). ViSA leverages massive
image-caption pairs as weak image-level (not instance-level) supervision to
learn a rich cross-modal semantic space where the representations of visual
instances (not images) and those of textual queries are aligned, thus allowing
us to measure the similarities between any visual instance and an arbitrary
textual query. To evaluate the performance of ViSA, we build two datasets named
OVIS40 and OVIS1600 and also introduce a pipeline for error analysis. Through
extensive experiments on the two datasets, we demonstrate ViSA's ability to
search for visual instances in images not available during training given a
wide range of textual queries including those composed of uncommon words.
Experimental results show that ViSA achieves an mAP@50 of 21.9% on OVIS40 under
the most challenging setting and achieves an mAP@6 of 14.9% on OVIS1600
dataset.
Related papers
- WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild [88.05964311416717]
We introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis.
WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria.
We demonstrate WildVis' utility through three case studies: facilitating misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns.
arXiv Detail & Related papers (2024-09-05T17:59:15Z) - HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with
Context Augmentation and Visual Assistance [5.5532783549057845]
We propose a multi-modal retrieval framework that maximally leverages pretrained Vision-Language models.
Our system does not produce the most competitive results at SemEval-2023 Task 1, but we are still able to beat nearly half of the teams.
arXiv Detail & Related papers (2023-11-30T06:23:15Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain.
EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - See Finer, See More: Implicit Modality Alignment for Text-based Person
Retrieval [19.687373765453643]
We introduce an Implicit Visual-Textual (IVT) framework for text-based person retrieval.
IVT utilizes a single network to learn representation for both modalities, which contributes to the visual-textual interaction.
arXiv Detail & Related papers (2022-08-18T03:04:37Z) - ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval [66.66400551173619]
We propose a full transformer architecture to unify cross-modal retrieval scenarios in a single $textbfVi$sion.
We develop dual contrastive learning losses to embed both image-text pairs and fusion-text pairs into a common cross-modal space.
Experimental results show that ViSTA outperforms other methods by at least $bf8.4%$ at Recall@1 for scene text aware retrieval task.
arXiv Detail & Related papers (2022-03-31T03:40:21Z) - LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation [5.064384692591668]
This paper proposes LAViTeR, a novel architecture for visual and textual representation learning.
The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning.
The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment.
arXiv Detail & Related papers (2021-09-04T22:48:46Z) - StacMR: Scene-Text Aware Cross-Modal Retrieval [19.54677614738065]
Cross-modal retrieval models have benefited from an increasingly rich understanding of visual scenes.
Current models overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval.
We propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.
arXiv Detail & Related papers (2020-12-08T10:04:25Z) - ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural
Language [36.319953919737245]
Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions.
We propose an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions.
We achieve success as well as the performance boosting by a robust feature learning.
arXiv Detail & Related papers (2020-05-15T02:22:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.