Benchmarking Robustness of Text-Image Composed Retrieval
- URL: http://arxiv.org/abs/2311.14837v2
- Date: Thu, 30 Nov 2023 18:14:48 GMT
- Title: Benchmarking Robustness of Text-Image Composed Retrieval
- Authors: Shitong Sun, Jindong Gu, Shaogang Gong
- Abstract summary: Text-image composed retrieval aims to retrieve the target image through the composed query.
It has recently attracted attention due to its ability to leverage both information-rich images and concise language.
However, the robustness of these approaches against real-world corruptions or further text understanding has never been studied.
- Score: 46.98557472744255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-image composed retrieval aims to retrieve the target image through the
composed query, which is specified in the form of an image plus some text that
describes desired modifications to the input image. It has recently attracted
attention due to its ability to leverage both information-rich images and
concise language to precisely express the requirements for target images.
However, the robustness of these approaches against real-world corruptions or
further text understanding has never been studied. In this paper, we perform
the first robustness study and establish three new diversified benchmarks for
systematic analysis of text-image composed retrieval against natural
corruptions in both vision and text and further probe textural understanding.
For natural corruption analysis, we introduce two new large-scale benchmark
datasets, CIRR-C and FashionIQ-C for testing in open domain and fashion domain
respectively, both of which apply 15 visual corruptions and 7 textural
corruptions. For textural understanding analysis, we introduce a new diagnostic
dataset CIRR-D by expanding the original raw data with synthetic data, which
contains modified text to better probe textual understanding ability including
numerical variation, attribute variation, object removal, background variation,
and fine-grained evaluation. The code and benchmark datasets are available at
https://github.com/SunTongtongtong/Benchmark-Robustness-Text-Image-Compose-Retrieval.
Related papers
- FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Text Image Inpainting via Global Structure-Guided Diffusion Models [22.859984320894135]
Real-world text can be damaged by corrosion issues caused by environmental or human factors.
Current inpainting techniques often fail to adequately address this problem.
We develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution.
arXiv Detail & Related papers (2024-01-26T13:01:28Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Semantic-Preserving Augmentation for Robust Image-Text Retrieval [27.2916415148638]
RVSE consists of novel image-based and text-based augmentation techniques called semantic preserving augmentation for image (SPAugI) and text (SPAugT)
Since SPAugI and SPAugT change the original data in a way that its semantic information is preserved, we enforce the feature extractors to generate semantic aware embedding vectors.
From extensive experiments using benchmark datasets, we show that RVSE outperforms conventional retrieval schemes in terms of image-text retrieval performance.
arXiv Detail & Related papers (2023-03-10T03:50:44Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Image Search with Text Feedback by Additive Attention Compositional
Learning [1.4395184780210915]
We propose an image-text composition module based on additive attention that can be seamlessly plugged into deep neural networks.
AACL is evaluated on three large-scale datasets (FashionIQ, Fashion200k, and Shopping100k)
arXiv Detail & Related papers (2022-03-08T02:03:49Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.