TokenFlow: Rethinking Fine-grained Cross-modal Alignment in
Vision-Language Retrieval
- URL: http://arxiv.org/abs/2209.13822v1
- Date: Wed, 28 Sep 2022 04:11:05 GMT
- Title: TokenFlow: Rethinking Fine-grained Cross-modal Alignment in
Vision-Language Retrieval
- Authors: Xiaohan Zou, Changqiao Wu, Lele Cheng, Zhongyuan Wang
- Abstract summary: We devise a new model-agnostic formulation for fine-grained cross-modal alignment.
Inspired by optimal transport theory, we introduce emphTokenFlow, an instantiation of the proposed scheme.
- Score: 30.429340065755436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing methods in vision-language retrieval match two modalities by
either comparing their global feature vectors which misses sufficient
information and lacks interpretability, detecting objects in images or videos
and aligning the text with fine-grained features which relies on complicated
model designs, or modeling fine-grained interaction via cross-attention upon
visual and textual tokens which suffers from inferior efficiency. To address
these limitations, some recent works simply aggregate the token-wise
similarities to achieve fine-grained alignment, but they lack intuitive
explanations as well as neglect the relationships between token-level features
and global representations with high-level semantics. In this work, we rethink
fine-grained cross-modal alignment and devise a new model-agnostic formulation
for it. We additionally demystify the recent popular works and subsume them
into our scheme. Furthermore, inspired by optimal transport theory, we
introduce \emph{TokenFlow}, an instantiation of the proposed scheme. By
modifying only the similarity function, the performance of our method is
comparable to the SoTA algorithms with heavy model designs on major video-text
retrieval benchmarks. The visualization further indicates that \emph{TokenFlow}
successfully leverages the fine-grained information and achieves better
interpretability.
Related papers
- FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers.
We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z) - VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed textbfVisual textbfMinimal-Change Understanding (VisMin)
VisMin requires models to predict the correct image-caption match given two images and two captions.
We generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks.
arXiv Detail & Related papers (2024-07-23T18:10:43Z) - Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment [40.63340635482609]
Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner.
We advocate for assigning distinct contributions for each text token based on its visual correlation.
We introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens.
arXiv Detail & Related papers (2024-05-28T06:44:13Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - What Are You Token About? Dense Retrieval as Distributions Over the
Vocabulary [68.77983831618685]
We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space.
We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval.
arXiv Detail & Related papers (2022-12-20T16:03:25Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - SA-VQA: Structured Alignment of Visual and Semantic Representations for
Visual Question Answering [29.96818189046649]
We propose to apply structured alignments, which work with graph representation of visual and textual content.
As demonstrated in our experimental results, such a structured alignment improves reasoning performance.
The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.
arXiv Detail & Related papers (2022-01-25T22:26:09Z) - Towards Efficient Scene Understanding via Squeeze Reasoning [71.1139549949694]
We propose a novel framework called Squeeze Reasoning.
Instead of propagating information on the spatial map, we first learn to squeeze the input feature into a channel-wise global vector.
We show that our approach can be modularized as an end-to-end trained block and can be easily plugged into existing networks.
arXiv Detail & Related papers (2020-11-06T12:17:01Z) - Logic Constrained Pointer Networks for Interpretable Textual Similarity [11.142649867439406]
We introduce a novel pointer network based model with a sentinel gating function to align constituent chunks.
We improve this base model with a loss function to equally penalize misalignments in both sentences, ensuring the alignments are bidirectional.
The model achieves an F1 score of 97.73 and 96.32 on the benchmark SemEval datasets for the chunk alignment task.
arXiv Detail & Related papers (2020-07-15T13:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.