StacMR: Scene-Text Aware Cross-Modal Retrieval
- URL: http://arxiv.org/abs/2012.04329v1
- Date: Tue, 8 Dec 2020 10:04:25 GMT
- Title: StacMR: Scene-Text Aware Cross-Modal Retrieval
- Authors: Andr\'es Mafla and Rafael Sampaio de Rezende and Llu\'is G\'omez and
Diane Larlus and Dimosthenis Karatzas
- Abstract summary: Cross-modal retrieval models have benefited from an increasingly rich understanding of visual scenes.
Current models overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval.
We propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.
- Score: 19.54677614738065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent models for cross-modal retrieval have benefited from an increasingly
rich understanding of visual scenes, afforded by scene graphs and object
interactions to mention a few. This has resulted in an improved matching
between the visual representation of an image and the textual representation of
its caption. Yet, current visual representations overlook a key aspect: the
text appearing in images, which may contain crucial information for retrieval.
In this paper, we first propose a new dataset that allows exploration of
cross-modal retrieval where images contain scene-text instances. Then, armed
with this dataset, we describe several approaches which leverage scene text,
including a better scene-text aware cross-modal retrieval method which uses
specialized representations for text from the captions and text from the visual
scene, and reconcile them in a common embedding space. Extensive experiments
confirm that cross-modal retrieval approaches benefit from scene text and
highlight interesting research questions worth exploring further. Dataset and
code are available at http://europe.naverlabs.com/stacmr
Related papers
- Scene Graph Based Fusion Network For Image-Text Retrieval [2.962083552798791]
A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts.
We propose a Scene Graph based Fusion Network (dubbed SGFN) which enhances the images'/texts' features through intra- and cross-modal fusion.
Our SGFN performs better than quite a few SOTA image-text retrieval methods.
arXiv Detail & Related papers (2023-03-20T13:22:56Z) - ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval [66.66400551173619]
We propose a full transformer architecture to unify cross-modal retrieval scenarios in a single $textbfVi$sion.
We develop dual contrastive learning losses to embed both image-text pairs and fusion-text pairs into a common cross-modal space.
Experimental results show that ViSTA outperforms other methods by at least $bf8.4%$ at Recall@1 for scene text aware retrieval task.
arXiv Detail & Related papers (2022-03-31T03:40:21Z) - Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis.
The first hierarchical scene text dataset is introduced to enable this novel research task.
We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z) - Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is.
In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where")
Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval [41.505920288928365]
multimodal data has inspired interest in cross-modal retrieval methods.
We propose novel within-modality losses which encourage semantic coherency in both the text and image subspaces.
Our method ensures that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed.
arXiv Detail & Related papers (2020-07-16T20:32:54Z) - Textual Visual Semantic Dataset for Text Spotting [27.788077963411624]
Text Spotting in the wild consists of detecting and recognizing text appearing in images.
This is a challenging problem due to the complexity of the context where texts appear.
We propose a visual context dataset for Text Spotting in the wild.
arXiv Detail & Related papers (2020-04-21T23:58:16Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.