Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation
- URL: http://arxiv.org/abs/2407.17274v1
- Date: Wed, 24 Jul 2024 13:39:51 GMT
- Title: Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation
- Authors: Yongqi Li, Hongru Cai, Wenjie Wang, Leigang Qu, Yinwei Wei, Wenjie Li, Liqiang Nie, Tat-Seng Chua,
- Abstract summary: Text-to-image retrieval is a fundamental task in multimedia processing.
We propose an autoregressive voken generation method, named AVG.
We show that AVG achieves superior results in both effectiveness and efficiency.
- Score: 90.71613903956451
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval has emerged as a new research line, which assigns images with unique string identifiers and generates the target identifier as the retrieval target. Despite its great potential, existing generative approaches are limited due to the following issues: insufficient visual information in identifiers, misalignment with high-level semantics, and learning gap towards the retrieval target. To address the above issues, we propose an autoregressive voken generation method, named AVG. AVG tokenizes images into vokens, i.e., visual tokens, and innovatively formulates the text-to-image retrieval task as a token-to-voken generation problem. AVG discretizes an image into a sequence of vokens as the identifier of the image, while maintaining the alignment with both the visual information and high-level semantics of the image. Additionally, to bridge the learning gap between generative training and the retrieval target, we incorporate discriminative training to modify the learning direction during token-to-voken training. Extensive experiments demonstrate that AVG achieves superior results in both effectiveness and efficiency.
Related papers
- Dual-Level Cross-Modal Contrastive Clustering [4.083185193413678]
We propose a novel image clustering framwork, named Dual-level Cross-Modal Contrastive Clustering (DXMC)
external textual information is introduced for constructing a semantic space which is adopted to generate image-text pairs.
The image-text pairs are respectively sent to pre-trained image and text encoder to obtain image and text embeddings which subsquently are fed into four well-designed networks.
arXiv Detail & Related papers (2024-09-06T18:49:45Z) - Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens [28.356269620160937]
We propose a Contextual-Injection Attack (CIA) that employs gradient-based perturbation to inject target tokens into both visual and textual contexts.
CIA enhances the cross-prompt transferability of adversarial images.
arXiv Detail & Related papers (2024-06-19T07:32:55Z) - IR-GAN: Image Manipulation with Linguistic Instruction by Increment
Reasoning [110.7118381246156]
Increment Reasoning Generative Adversarial Network (IR-GAN) aims to reason consistency between visual increment in images and semantic increment in instructions.
First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment.
Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary.
arXiv Detail & Related papers (2022-04-02T07:48:39Z) - Semantic-Aware Generation for Self-Supervised Visual Representation
Learning [116.5814634936371]
We advocate for Semantic-aware Generation (SaGe) to facilitate richer semantics rather than details to be preserved in the generated image.
SaGe complements the target network with view-specific features and thus alleviates the semantic degradation brought by intensive data augmentations.
We execute SaGe on ImageNet-1K and evaluate the pre-trained models on five downstream tasks including nearest neighbor test, linear classification, and fine-scaled image recognition.
arXiv Detail & Related papers (2021-11-25T16:46:13Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - SSKD: Self-Supervised Knowledge Distillation for Cross Domain Adaptive
Person Re-Identification [25.96221714337815]
Domain adaptive person re-identification (re-ID) is a challenging task due to the large discrepancy between the source domain and the target domain.
Existing methods mainly attempt to generate pseudo labels for unlabeled target images by clustering algorithms.
We propose a Self-Supervised Knowledge Distillation (SSKD) technique containing two modules, the identity learning and the soft label learning.
arXiv Detail & Related papers (2020-09-13T10:12:02Z) - Tasks Integrated Networks: Joint Detection and Retrieval for Image
Search [99.49021025124405]
In many real-world searching scenarios (e.g., video surveillance), the objects are seldom accurately detected or annotated.
We first introduce an end-to-end Integrated Net (I-Net), which has three merits.
We further propose an improved I-Net, called DC-I-Net, which makes two new contributions.
arXiv Detail & Related papers (2020-09-03T03:57:50Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.