Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval
- URL: http://arxiv.org/abs/2102.04980v1
- Date: Tue, 9 Feb 2021 17:54:34 GMT
- Title: Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval
- Authors: Soravit Changpinyo, Jordi Pont-Tuset, Vittorio Ferrari, Radu Soricut
- Abstract summary: Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is.
In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where")
Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
- Score: 60.24860627782486
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing image retrieval systems use text queries to provide a natural and
practical way for users to express what they are looking for. However,
fine-grained image retrieval often requires the ability to also express the
where in the image the content they are looking for is. The textual modality
can only cumbersomely express such localization preferences, whereas pointing
would be a natural fit. In this paper, we describe an image retrieval setup
where the user simultaneously describes an image using both spoken natural
language (the "what") and mouse traces over an empty canvas (the "where") to
express the characteristics of the desired target image. To this end, we learn
an image retrieval model using the Localized Narratives dataset, which is
capable of performing early fusion between text descriptions and synchronized
mouse traces. Qualitative and quantitative experiments show that our model is
capable of taking this spatial guidance into account, and provides more
accurate retrieval results compared to text-only equivalent systems.
Related papers
- Composed Image Retrieval for Remote Sensing [24.107610091033997]
This work introduces composed image retrieval to remote sensing.
It allows to query a large image archive by image examples alternated by a textual description.
A novel method fusing image-to-image and text-to-image similarity is introduced.
arXiv Detail & Related papers (2024-05-24T14:18:31Z) - You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Bridging the Gap between Local Semantic Concepts and Bag of Visual Words
for Natural Scene Image Retrieval [0.0]
A typical content-based image retrieval system deals with the query image and images in the dataset as a collection of low-level features.
Top ranked images in the retrieved list, which have high similarities to the query image, may be different from the query image in terms of the semantic interpretation of the user.
This paper investigates how natural scene retrieval can be performed using the bag of visual word model and the distribution of local semantic concepts.
arXiv Detail & Related papers (2022-10-17T09:10:50Z) - Using Text to Teach Image Retrieval [47.72498265721957]
We build on the concept of image manifold to represent the feature space of images, learned via neural networks, as a graph.
We augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images.
The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval.
arXiv Detail & Related papers (2020-11-19T16:09:14Z) - Text-to-Image Generation Grounded by Fine-Grained User Attention [62.94737811887098]
Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces.
We propose TReCS, a sequential model that exploits this grounding to generate images.
arXiv Detail & Related papers (2020-11-07T13:23:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.