Related papers: Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

URL: http://arxiv.org/abs/2011.15124v2
Date: Sun, 30 May 2021 23:37:58 GMT
Title: Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
Authors: Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, Desmond Elliott
Abstract summary: Methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We conduct controlled experiments to discern the empirical differences between five V&L BERTs.
Score: 57.74359320513427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorised into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five V&L BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

Related papers

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition [17.790383360652704]
Training for few-shot multimodal dialogue intention recognition involves two interconnected tasks. This phenomenon is attributed to knowledge interference stemming from the superposition of weight matrix updates during the training process. We propose Knowledge-Decoupled Synergetic Learning, which transforms knowledge into interpretable rules, while applying the post-training of larger models.
arXiv Detail & Related papers (2025-03-06T08:28:44Z)
Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances from multiple modalities when only a few labeled examples are available. We propose a Generative Transfer Learning framework consisting of two stages: the first involves training on abundant unimodal data, and the second focuses on transfer learning to adapt to novel data. Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets.
arXiv Detail & Related papers (2024-10-14T16:09:38Z)
Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition [83.13280812128411]
Recent studies have uncovered intriguing phenomena in deep learning, such as grokking, double descent, and emergent abilities in large language models. We present a comprehensive framework that provides a unified view of these three phenomena, focusing on the competition between memorization and generalization circuits.
arXiv Detail & Related papers (2024-02-23T08:14:36Z)
Towards Understanding How Transformers Learn In-context Through a Representation Learning Lens [9.590540796223715]
In this paper, we attempt to explore the in-context learning process in Transformers through a lens of representation learning. The ICL inference process of the attention layer aligns with the training procedure of its dual model, generating token representation predictions. We extend our theoretical conclusions to more complicated scenarios, including one Transformer layer and multiple attention layers.
arXiv Detail & Related papers (2023-10-20T01:55:34Z)
BERT-ERC: Fine-tuning BERT is Enough for Emotion Recognition in Conversation [19.663265448700002]
Previous works on emotion recognition in conversation (ERC) follow a two-step paradigm. We propose a novel paradigm, i.e., exploring contextual information and dialogue structure information in the fine-tuning step. We develop our model BERT-ERC according to the proposed paradigm, which improves ERC performance in three aspects.
arXiv Detail & Related papers (2023-01-17T08:03:32Z)
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language. We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language. We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
A Multi-level Supervised Contrastive Learning Framework for Low-Resource Natural Language Inference [54.678516076366506]
Natural Language Inference (NLI) is a growingly essential task in natural language understanding. Here we propose a multi-level supervised contrastive learning framework named MultiSCL for low-resource natural language inference.
arXiv Detail & Related papers (2022-05-31T05:54:18Z)
Image Difference Captioning with Pre-training and Contrastive Learning [45.59621065755761]
The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language. The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require learning stronger vision and language association and 2) high-cost of manual annotations. We propose a new modeling framework following the pre-training-finetuning paradigm to address these challenges.
arXiv Detail & Related papers (2022-02-09T06:14:22Z)
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition. Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction. Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z)
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training. Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.