A New Fine-grained Alignment Method for Image-text Matching
- URL: http://arxiv.org/abs/2311.02183v2
- Date: Thu, 7 Dec 2023 08:48:37 GMT
- Title: A New Fine-grained Alignment Method for Image-text Matching
- Authors: Yang Zhang
- Abstract summary: Cross-Modal Prominent Fragments Enhancement Aligning Network achieves improved retrieval accuracy.
In practice, we first design a novel intra-modal fragments relationship reasoning method.
Our approach outperforms state-of-the-art methods by about 5% to 10% in the rSum metric.
- Score: 4.33417045761714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text retrieval is a widely studied topic in the field of computer
vision due to the exponential growth of multimedia data, whose core concept is
to measure the similarity between images and text. However, most existing
retrieval methods heavily rely on cross-attention mechanisms for cross-modal
fine-grained alignment, which takes into account excessive irrelevant regions
and treats prominent and non-significant words equally, thereby limiting
retrieval accuracy. This paper aims to investigate an alignment approach that
reduces the involvement of non-significant fragments in images and text while
enhancing the alignment of prominent segments. For this purpose, we introduce
the Cross-Modal Prominent Fragments Enhancement Aligning Network(CPFEAN), which
achieves improved retrieval accuracy by diminishing the participation of
irrelevant regions during alignment and relatively increasing the alignment
similarity of prominent words. Additionally, we incorporate prior textual
information into image regions to reduce misalignment occurrences. In practice,
we first design a novel intra-modal fragments relationship reasoning method,
and subsequently employ our proposed alignment mechanism to compute the
similarity between images and text. Extensive quantitative comparative
experiments on MS-COCO and Flickr30K datasets demonstrate that our approach
outperforms state-of-the-art methods by about 5% to 10% in the rSum metric.
Related papers
- Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation [28.24883865053459]
This paper aims to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations.
Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts.
A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments.
arXiv Detail & Related papers (2024-04-05T17:25:17Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z) - Similarity Reasoning and Filtration for Image-Text Matching [85.68854427456249]
We propose a novel Similarity Graph Reasoning and Attention filtration network for image-text matching.
Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relation-aware similarities with both the local and global alignments.
We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets.
arXiv Detail & Related papers (2021-01-05T06:29:35Z) - Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using
Transformer Encoders [14.634046503477979]
We present a novel approach called Transformer Reasoning and Alignment Network (TERAN)
TERAN enforces a fine-grained match between the underlying components of images and sentences.
On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks.
arXiv Detail & Related papers (2020-08-12T11:02:40Z) - Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval [41.505920288928365]
multimodal data has inspired interest in cross-modal retrieval methods.
We propose novel within-modality losses which encourage semantic coherency in both the text and image subspaces.
Our method ensures that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed.
arXiv Detail & Related papers (2020-07-16T20:32:54Z) - Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition.
GSRM is introduced to capture global semantic context through multi-way parallel transmission.
Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.