Enhance Multimodal Model Performance with Data Augmentation: Facebook
Hateful Meme Challenge Solution
- URL: http://arxiv.org/abs/2105.13132v1
- Date: Tue, 25 May 2021 01:07:09 GMT
- Title: Enhance Multimodal Model Performance with Data Augmentation: Facebook
Hateful Meme Challenge Solution
- Authors: Yang Li, Zinc Zhang, Hutchin Huang
- Abstract summary: The Hateful Memes Challenge from Facebook helps fulfill such potential by challenging the contestants to detect hateful speech.
In this paper, we utilize multi-modal, pre-trained models VilBERT and Visual BERT.
Our approach achieved 0.7439 AUROC along with an accuracy of 0.7037 on the challenge's test set, which revealed remarkable progress.
- Score: 3.8325907381729496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hateful content detection is one of the areas where deep learning can and
should make a significant difference. The Hateful Memes Challenge from Facebook
helps fulfill such potential by challenging the contestants to detect hateful
speech in multi-modal memes using deep learning algorithms. In this paper, we
utilize multi-modal, pre-trained models VilBERT and Visual BERT. We improved
models' performance by adding training datasets generated from data
augmentation. Enlarging the training data set helped us get a more than 2%
boost in terms of AUROC with the Visual BERT model. Our approach achieved
0.7439 AUROC along with an accuracy of 0.7037 on the challenge's test set,
which revealed remarkable progress.
Related papers
- Cross-Modal Transfer from Memes to Videos: Addressing Data Scarcity in Hateful Video Detection [8.05088621131726]
Video-based hate speech detection remains under-explored, hindered by a lack of annotated datasets and the high cost of video annotation.
We leverage meme datasets as both a substitution and an augmentation strategy for training hateful video detection models.
Our results consistently outperform state-of-the-art benchmarks.
arXiv Detail & Related papers (2025-01-26T07:50:14Z) - T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs [102.66246727371583]
We develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus.
We find that the proposed scheme can boost the performance of long video understanding without training with long video samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models.
We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER.
Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z) - Hateful Memes Challenge: An Enhanced Multimodal Framework [0.0]
Hateful Meme Challenge proposed by Facebook AI has attracted contestants around the world.
Various state-of-the-art deep learning models have been applied to this problem.
In this paper, we enhance the hateful detection framework, including utilizing Detectron for feature extraction.
arXiv Detail & Related papers (2021-12-20T07:47:17Z) - Classification of Multimodal Hate Speech -- The Winning Solution of
Hateful Memes Challenge [0.0]
Hateful Memes is a new challenge set for multimodal classification.
Difficult examples are added to the dataset to make it hard to rely on unimodal signals.
I propose a new model that combined multimodal with rules, which achieve the first ranking of accuracy and AUROC of 86.8% and 0.923 respectively.
arXiv Detail & Related papers (2020-12-02T07:38:26Z) - Recognizing More Emotions with Less Data Using Self-supervised Transfer
Learning [0.0]
We propose a novel transfer learning method for speech emotion recognition.
With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data.
arXiv Detail & Related papers (2020-11-11T06:18:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.