MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching
- URL: http://arxiv.org/abs/2402.17237v2
- Date: Wed, 19 Feb 2025 06:39:26 GMT
- Title: MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching
- Authors: Wanqing Cui, Rui Cheng, Jiafeng Guo, Xueqi Cheng,
- Abstract summary: We propose a Multi-view Attention Method (MVAM) for image-text matching.
We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.
Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
- Score: 65.87255122130188
- License:
- Abstract: Existing two-stream models, such as CLIP, encode images and text through independent representations, showing good performance while ensuring retrieval speed, have attracted attention from industry and academia. However, the single representation often struggles to capture complex content fully. Such models may ignore fine-grained information during matching, resulting in suboptimal retrieval results. To overcome this limitation and enhance the performance of two-stream models, we propose a Multi-view Attention Method (MVAM) for image-text matching. This approach leverages diverse attention heads with unique view codes to learn multiple representations for images and text, which are then concatenated for matching. We also incorporate a diversity objective to explicitly encourage attention heads to focus on distinct aspects of the input data, capturing complementary fine-grained details. This diversity enables the model to represent image-text pairs from multiple perspectives, ensuring a more comprehensive understanding and alignment of critical content. Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance. Our experiments on MSCOCO and Flickr30K demonstrate enhancements over existing models, and further case studies reveal that different attention heads can focus on distinct content, achieving more comprehensive representations.
Related papers
- Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training [30.071860810401933]
This paper advances contrastive language-image pre-training (CLIP) into one novel holistic paradigm.
We use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies.
Our holistic CLIP significantly outperforms existing CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
arXiv Detail & Related papers (2024-11-30T11:27:58Z) - Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens.
We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z) - Learning Comprehensive Representations with Richer Self for
Text-to-Image Person Re-Identification [34.289949134802086]
Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text.
Existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view.
We propose a framework, called LCR$2$S, for modeling many-to-many correspondences of the same identity by learning representations for both modalities from a novel perspective.
arXiv Detail & Related papers (2023-10-17T12:39:16Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - Cross-Media Keyphrase Prediction: A Unified Framework with
Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.