Unified Lexical Representation for Interpretable Visual-Language Alignment
- URL: http://arxiv.org/abs/2407.17827v2
- Date: Mon, 11 Nov 2024 13:46:50 GMT
- Title: Unified Lexical Representation for Interpretable Visual-Language Alignment
- Authors: Yifan Li, Yikai Wang, Yanwei Fu, Dongyu Ru, Zheng Zhang, Tong He,
- Abstract summary: We introduce LexVLA, a framework for learning a unified lexical representation for both modalities without complex design.
We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability.
We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on the modest multi-modal dataset.
- Score: 52.059812317944434
- License:
- Abstract: Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words. However, lexical representations are difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively. In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability. To avoid the false discovery, we propose an overuse penalty to refrain the lexical representation from falsely frequently activating meaningless words. We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on the modest multi-modal dataset and avoid intricate training configurations. On cross-modal retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC-12M). We conduct extensive experiments to analyze LexVLA. Codes are available at https://github.com/Clementine24/LexVLA.
Related papers
- Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - Visualizing Linguistic Diversity of Text Datasets Synthesized by Large
Language Models [9.808214545408541]
LinguisticLens is a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of datasets.
It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples.
arXiv Detail & Related papers (2023-05-19T00:53:45Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - Does Vision-and-Language Pretraining Improve Lexical Grounding? [25.357191933430627]
Vision-and-Language models are trained jointly on text and image or video data.
It is not yet known how the internal linguistic representations themselves compare to their text-only counterparts.
arXiv Detail & Related papers (2021-09-21T15:12:39Z) - Does language help generalization in vision models? [0.0]
We show that a visual model trained on a very large supervised image dataset (ImageNet-21k) can be as efficient for generalization as its multimodal counterpart (CLIP)
When compared to other standard visual or language models, the latent representations of BiT-M were found to be just as "linguistic" as those of CLIP.
arXiv Detail & Related papers (2021-04-16T18:54:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.