ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions
- URL: http://arxiv.org/abs/2405.19226v1
- Date: Wed, 29 May 2024 16:06:21 GMT
- Title: ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions
- Authors: Honglin Lin, Siyu Li, Guoshun Nan, Chaoyue Tang, Xueting Wang, Jingxin Xu, Rong Yankai, Zhili Zhou, Yutong Gao, Qimei Cui, Xiaofeng Tao,
- Abstract summary: Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text.
We propose ContextBLIP, a method that relies on a doubly contextual alignment scheme for challenging IRCD.
We observe that ContextBLIP can yield comparable results with GPT-4V, despite involving about 7,500 times fewer parameters.
- Score: 17.934227561793474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text. Despite the success of VLMs, they still significantly lag behind human performance in IRCD. The main challenges lie in aligning key contextual cues in two modalities, where these subtle cues are concealed in tiny areas of multiple contrastive images and within the complex linguistics of textual descriptions. This motivates us to propose ContextBLIP, a simple yet effective method that relies on a doubly contextual alignment scheme for challenging IRCD. Specifically, 1) our model comprises a multi-scale adapter, a matching loss, and a text-guided masking loss. The adapter learns to capture fine-grained visual cues. The two losses enable iterative supervision for the adapter, gradually highlighting the focal patches of a single image to the key textual cues. We term such a way as intra-contextual alignment. 2) Then, ContextBLIP further employs an inter-context encoder to learn dependencies among candidates, facilitating alignment between the text to multiple images. We term this step as inter-contextual alignment. Consequently, the nuanced cues concealed in each modality can be effectively aligned. Experiments on two benchmarks show the superiority of our method. We observe that ContextBLIP can yield comparable results with GPT-4V, despite involving about 7,500 times fewer parameters.
Related papers
- TMCIR: Token Merge Benefits Composed Image Retrieval [13.457620649082504]
Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications.
Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation.
We propose a novel framework that advances composed image retrieval through two key innovations.
arXiv Detail & Related papers (2025-04-15T09:14:04Z) - Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training [30.071860810401933]
This paper advances contrastive language-image pre-training (CLIP) into one novel holistic paradigm.
We use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies.
Our holistic CLIP significantly outperforms existing CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
arXiv Detail & Related papers (2024-11-30T11:27:58Z) - SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for
Multimodal Alignment [11.556516260190737]
Multimodal alignment between language and vision is the fundamental topic in current vision-language model research.
This paper proposes Contrastive Captioners (CoCa) to integrate Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework.
arXiv Detail & Related papers (2024-01-04T08:42:36Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Augmenters at SemEval-2023 Task 1: Enhancing CLIP in Handling
Compositionality and Ambiguity for Zero-Shot Visual WSD through Prompt
Augmentation and Text-To-Image Diffusion [7.708214550816408]
This paper describes our zero-shot approaches for the Visual Word Sense Disambiguation Task in English.
Our preliminary study shows that the simple approach of matching candidate images with the phrase using CLIP suffers from the many-to-many nature of image-text pairs.
We find that the CLIP text encoder may have limited abilities in capturing the compositionality in natural language.
arXiv Detail & Related papers (2023-07-09T22:39:37Z) - PV2TEA: Patching Visual Modality to Textual-Established Information
Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor.
PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes.
Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z) - Exploring Diverse In-Context Configurations for Image Captioning [39.54017777410428]
In this paper, we explore the effects of varying configurations on Vision-Language (VL) in-context learning.
We devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning.
Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning.
arXiv Detail & Related papers (2023-05-24T06:52:47Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Semantically Self-Aligned Network for Text-to-Image Part-aware Person
Re-identification [78.45528514468836]
Text-to-image person re-identification (ReID) aims to search for images containing a person of interest using textual descriptions.
We propose a Semantically Self-Aligned Network (SSAN) to handle the above problems.
To expedite future research in text-to-image ReID, we build a new database named ICFG-PEDES.
arXiv Detail & Related papers (2021-07-27T08:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.