Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
Pretraining?
- URL: http://arxiv.org/abs/2308.12898v2
- Date: Fri, 25 Aug 2023 12:22:53 GMT
- Title: Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
Pretraining?
- Authors: Fei Wang, Liang Ding, Jun Rao, Ye Liu, Li Shen, Changxing Ding
- Abstract summary: We aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment.
Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark.
- Score: 34.609984453754656
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The multimedia community has shown a significant interest in perceiving and
representing the physical world with multimodal pretrained neural network
models, and among them, the visual-language pertaining (VLP) is, currently, the
most captivating topic. However, there have been few endeavors dedicated to the
exploration of 1) whether essential linguistic knowledge (e.g., semantics and
syntax) can be extracted during VLP, and 2) how such linguistic knowledge
impact or enhance the multimodal alignment. In response, here we aim to
elucidate the impact of comprehensive linguistic knowledge, including semantic
expression and syntactic structure, on multimodal alignment. Specifically, we
design and release the SNARE, the first large-scale multimodal alignment
probing benchmark, to detect the vital linguistic components, e.g., lexical,
semantic, and syntax knowledge, containing four tasks: Semantic structure,
Negation logic, Attribute ownership, and Relationship composition. Based on our
proposed probing benchmarks, our holistic analyses of five advanced VLP models
illustrate that the VLP model: i) shows insensitivity towards complex syntax
structures and relies on content words for sentence comprehension; ii)
demonstrates limited comprehension of combinations between sentences and
negations; iii) faces challenges in determining the presence of actions or
spatial relationships within visual information and struggles with verifying
the correctness of triple combinations. We make our benchmark and code
available at \url{https://github.com/WangFei-2019/SNARE/}.
Related papers
- HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model [9.762722976833581]
Current models rely extensively on instance-level alignment between video and language modalities.
We take an inspiration from human perception and explore a compositional approach for ego video representation.
arXiv Detail & Related papers (2024-06-01T05:41:12Z) - Resolving Word Vagueness with Scenario-guided Adapter for Natural Language Inference [24.58277380514406]
Natural Language Inference (NLI) is a crucial task in natural language processing.
We propose an innovative ScenaFuse adapter that simultaneously integrates large-scale pre-trained linguistic knowledge and relevant visual information.
Our approach bridges the gap between language and vision, leading to improved understanding and inference capabilities in NLI tasks.
arXiv Detail & Related papers (2024-05-21T01:19:52Z) - A semantically enhanced dual encoder for aspect sentiment triplet
extraction [0.7291396653006809]
Aspect sentiment triplet extraction (ASTE) is a crucial subtask of aspect-based sentiment analysis (ABSA)
Previous research has focused on enhancing ASTE through innovative table-filling strategies.
We propose a framework that leverages both a basic encoder, primarily based on BERT, and a particular encoder comprising a Bi-LSTM network and graph convolutional network (GCN)
Experiments conducted on benchmark datasets demonstrate the state-of-the-art performance of our proposed framework.
arXiv Detail & Related papers (2023-06-14T09:04:14Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - ERICA: Improving Entity and Relation Understanding for Pre-trained
Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text.
Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.