Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
- URL: http://arxiv.org/abs/2502.08438v1
- Date: Wed, 12 Feb 2025 14:22:59 GMT
- Title: Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
- Authors: Prajwal Gatti, Kshitij Parikh, Dhriti Prasanna Paul, Manish Gupta, Anand Mishra,
- Abstract summary: Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them.
We propose a pretrained multimodal transformer-based baseline, STNET (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image.
Our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities.
- Score: 6.8273484064357515
- License:
- Abstract: Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for numbats. Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., numbat digging in the ground. In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of difficult-to-name but easy-to-draw objects and text describing difficult-to-sketch but easy-to-verbalize object attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of approx. 2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNET (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at our project website.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch [63.12810494378133]
We present an end-to-end trainable model for image retrieval using a text description and a sketch as input.
We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval.
arXiv Detail & Related papers (2022-08-05T18:43:37Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid [102.24539566851809]
Restoring reasonable and realistic content for arbitrary missing regions in images is an important yet challenging task.
Recent image inpainting models have made significant progress in generating vivid visual details, but they can still lead to texture blurring or structural distortions.
We propose the Semantic Pyramid Network (SPN) motivated by the idea that learning multi-scale semantic priors can greatly benefit the recovery of locally missing content in images.
arXiv Detail & Related papers (2021-12-08T04:33:33Z) - Towards Efficient Cross-Modal Visual Textual Retrieval using
Transformer-Encoder Deep Features [10.163477961551592]
Cross-modal retrieval is an important functionality in modern search engines.
In this paper, we focus on the image-sentence retrieval task.
We use the recently introduced TERN architecture as an image-sentence features extractor.
arXiv Detail & Related papers (2021-06-01T10:11:46Z) - StacMR: Scene-Text Aware Cross-Modal Retrieval [19.54677614738065]
Cross-modal retrieval models have benefited from an increasingly rich understanding of visual scenes.
Current models overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval.
We propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.
arXiv Detail & Related papers (2020-12-08T10:04:25Z) - SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval [15.074592583852167]
We focus on the task of text-conditioned image retrieval that utilizes support text feedback alongside a reference image to retrieve images.
We propose a novel framework SAC which resolves the above in two major steps: "where to see" (Semantic Feature Attention) and "how to change"
We show how our architecture streamlines the generation of text-aware image features by removing the need for various modules required by other state-of-art techniques.
arXiv Detail & Related papers (2020-09-03T06:55:23Z) - MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding [6.4901484665257545]
We propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data.
Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets.
arXiv Detail & Related papers (2020-01-11T05:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.