Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment
Analysis
- URL: http://arxiv.org/abs/2204.07955v2
- Date: Thu, 21 Apr 2022 12:46:38 GMT
- Title: Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment
Analysis
- Authors: Yan Ling, Jianfei Yu, Rui Xia
- Abstract summary: Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention in recent years.
Previous approaches either (i) use separately pre-trained visual and textual models, which ignore the crossmodal alignment or (ii) use vision-grained models pre-trained with general pre-training tasks.
We propose a task-specific Vision-Language Pre-training framework for MABSA (MABSA), which is a unified multimodal encoder-decoder architecture for all the pretraining and downstream tasks.
- Score: 25.482853330324748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As an important task in sentiment analysis, Multimodal Aspect-Based Sentiment
Analysis (MABSA) has attracted increasing attention in recent years. However,
previous approaches either (i) use separately pre-trained visual and textual
models, which ignore the crossmodal alignment or (ii) use vision-language
models pre-trained with general pre-training tasks, which are inadequate to
identify finegrained aspects, opinions, and their alignments across modalities.
To tackle these limitations, we propose a task-specific Vision-Language
Pre-training framework for MABSA (VLPMABSA), which is a unified multimodal
encoder-decoder architecture for all the pretraining and downstream tasks. We
further design three types of task-specific pre-training tasks from the
language, vision, and multimodal modalities, respectively. Experimental results
show that our approach generally outperforms the state-of-the-art approaches on
three MABSA subtasks. Further analysis demonstrates the effectiveness of each
pretraining task. The source code is publicly released at
https://github.com/NUSTM/VLP-MABSA.
Related papers
- A Multi-Task Semantic Decomposition Framework with Task-specific
Pre-training for Few-Shot NER [26.008350261239617]
We propose a Multi-Task Semantic Decomposition Framework via Joint Task-specific Pre-training for few-shot NER.
We introduce two novel pre-training tasks: Demonstration-based Masked Language Modeling (MLM) and Class Contrastive Discrimination.
In the downstream main task, we introduce a multi-task joint optimization framework with the semantic decomposing method, which facilitates the model to integrate two different semantic information for entity classification.
arXiv Detail & Related papers (2023-08-28T12:46:21Z) - Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
Information [77.80071279597665]
We propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training)
Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation.
arXiv Detail & Related papers (2022-11-17T18:59:49Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z) - VLM: Task-agnostic Video-Language Model Pre-training for Video
Understanding [78.28397557433544]
We present a task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks.
Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training.
arXiv Detail & Related papers (2021-05-20T19:13:27Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z) - Multi-Task Learning for Dense Prediction Tasks: A Survey [87.66280582034838]
Multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint.
We provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision.
arXiv Detail & Related papers (2020-04-28T09:15:50Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.