ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
- URL: http://arxiv.org/abs/2203.16778v1
- Date: Thu, 31 Mar 2022 03:40:21 GMT
- Title: ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
- Authors: Mengjun Cheng, Yipeng Sun, Longchao Wang, Xiongwei Zhu, Kun Yao, Jie
Chen, Guoli Song, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang
- Abstract summary: We propose a full transformer architecture to unify cross-modal retrieval scenarios in a single $textbfVi$sion.
We develop dual contrastive learning losses to embed both image-text pairs and fusion-text pairs into a common cross-modal space.
Experimental results show that ViSTA outperforms other methods by at least $bf8.4%$ at Recall@1 for scene text aware retrieval task.
- Score: 66.66400551173619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual appearance is considered to be the most important cue to understand
images for cross-modal retrieval, while sometimes the scene text appearing in
images can provide valuable information to understand the visual semantics.
Most of existing cross-modal retrieval approaches ignore the usage of scene
text information and directly adding this information may lead to performance
degradation in scene text free scenarios. To address this issue, we propose a
full transformer architecture to unify these cross-modal retrieval scenarios in
a single $\textbf{Vi}$sion and $\textbf{S}$cene $\textbf{T}$ext
$\textbf{A}$ggregation framework (ViSTA). Specifically, ViSTA utilizes
transformer blocks to directly encode image patches and fuse scene text
embedding to learn an aggregated visual representation for cross-modal
retrieval. To tackle the modality missing problem of scene text, we propose a
novel fusion token based transformer aggregation approach to exchange the
necessary scene text information only through the fusion token and concentrate
on the most important features in each modality. To further strengthen the
visual modality, we develop dual contrastive learning losses to embed both
image-text pairs and fusion-text pairs into a common cross-modal space.
Compared to existing methods, ViSTA enables to aggregate relevant scene text
semantics with visual appearance, and hence improve results under both scene
text free and scene text aware scenarios. Experimental results show that ViSTA
outperforms other methods by at least $\bf{8.4}\%$ at Recall@1 for scene text
aware retrieval task. Compared with state-of-the-art scene text free retrieval
methods, ViSTA can achieve better accuracy on Flicker30K and MSCOCO while
running at least three times faster during the inference stage, which validates
the effectiveness of the proposed framework.
Related papers
- Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation [1.9085074258303771]
We study the task of visually'' translating scene text from a source language to a target language.
Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image.
We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis.
arXiv Detail & Related papers (2023-08-06T05:23:25Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Scene Graph Based Fusion Network For Image-Text Retrieval [2.962083552798791]
A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts.
We propose a Scene Graph based Fusion Network (dubbed SGFN) which enhances the images'/texts' features through intra- and cross-modal fusion.
Our SGFN performs better than quite a few SOTA image-text retrieval methods.
arXiv Detail & Related papers (2023-03-20T13:22:56Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis.
The first hierarchical scene text dataset is introduced to enable this novel research task.
We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z) - StacMR: Scene-Text Aware Cross-Modal Retrieval [19.54677614738065]
Cross-modal retrieval models have benefited from an increasingly rich understanding of visual scenes.
Current models overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval.
We propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.
arXiv Detail & Related papers (2020-12-08T10:04:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.