Probing Cross-modal Semantics Alignment Capability from the Textual
Perspective
- URL: http://arxiv.org/abs/2210.09550v1
- Date: Tue, 18 Oct 2022 02:55:58 GMT
- Title: Probing Cross-modal Semantics Alignment Capability from the Textual
Perspective
- Authors: Zheng Ma, Shi Zong, Mianzhi Pan, Jianbing Zhang, Shujian Huang, Xinyu
Dai and Jiajun Chen
- Abstract summary: Aligning cross-modal semantics is claimed to be one of the essential capabilities of vision and language pre-training models.
We propose a new probing method that is based on image captioning to first empirically study the cross-modal semantics alignment of fjord models.
- Score: 52.52870614418373
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In recent years, vision and language pre-training (VLP) models have advanced
the state-of-the-art results in a variety of cross-modal downstream tasks.
Aligning cross-modal semantics is claimed to be one of the essential
capabilities of VLP models. However, it still remains unclear about the inner
working mechanism of alignment in VLP models. In this paper, we propose a new
probing method that is based on image captioning to first empirically study the
cross-modal semantics alignment of VLP models. Our probing method is built upon
the fact that given an image-caption pair, the VLP models will give a score,
indicating how well two modalities are aligned; maximizing such scores will
generate sentences that VLP models believe are of good alignment. Analyzing
these sentences thus will reveal in what way different modalities are aligned
and how well these alignments are in VLP models. We apply our probing method to
five popular VLP models, including UNITER, ROSITA, ViLBERT, CLIP, and LXMERT,
and provide a comprehensive analysis of the generated captions guided by these
models. Our results show that VLP models (1) focus more on just aligning
objects with visual words, while neglecting global semantics; (2) prefer fixed
sentence patterns, thus ignoring more important textual information including
fluency and grammar; and (3) deem the captions with more visual words are
better aligned with images. These findings indicate that VLP models still have
weaknesses in cross-modal semantics alignment and we hope this work will draw
researchers' attention to such problems when designing a new VLP model.
Related papers
- APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Zero-shot Visual Question Answering with Language Model Feedback [83.65140324876536]
We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA)
Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
arXiv Detail & Related papers (2023-05-26T15:04:20Z) - Position-guided Text Prompt for Vision-Language Pre-training [121.15494549650548]
We propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with Vision-Language Pre-Training.
PTP reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object.
PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot.
arXiv Detail & Related papers (2022-12-19T18:55:43Z) - Learning by Hallucinating: Vision-Language Pre-training with Weak
Supervision [6.8582563015193]
Weakly-supervised vision-language pre-training aims at learning cross-modal alignment with little or no paired data.
Recent methods, which pair visual features with object tags, help achieve performances comparable with some models trained with aligned pairs in various V-L downstream tasks.
We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH)
WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities.
arXiv Detail & Related papers (2022-10-24T20:30:55Z) - Counterfactually Measuring and Eliminating Social Bias in
Vision-Language Pre-training Models [13.280828458515062]
We introduce a counterfactual-based bias measurement emphCounterBias to quantify the social bias in Vision-Language Pre-training models.
We also construct a novel VL-Bias dataset including 24K image-text pairs for measuring gender bias.
arXiv Detail & Related papers (2022-07-03T14:39:32Z) - VL-CheckList: Evaluating Pre-trained Vision-Language Models with
Objects, Attributes and Relations [28.322824790738768]
Vision-Language Pretraining models have successfully facilitated many cross-modal downstream tasks.
Most existing works evaluated their systems by comparing the fine-tuned downstream task performance.
Inspired by the CheckList for testing natural language processing, we exploit VL-CheckList, a novel framework.
arXiv Detail & Related papers (2022-07-01T06:25:53Z) - VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix [59.25846149124199]
This paper proposes a data augmentation method, namely cross-modal CutMix.
CMC transforms natural sentences from the textual view into a multi-modal view.
By attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising.
arXiv Detail & Related papers (2022-06-17T17:56:47Z) - Word Shape Matters: Robust Machine Translation with Visual Embedding [78.96234298075389]
We introduce a new encoding of the input symbols for character-level NLP models.
It encodes the shape of each character through the images depicting the letters when printed.
We name this new strategy visual embedding and it is expected to improve the robustness of NLP models.
arXiv Detail & Related papers (2020-10-20T04:08:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.