Related papers: Enhance Multimodal Model Performance with Data Augmentation: Facebook Hateful Meme Challenge Solution

Enhance Multimodal Model Performance with Data Augmentation: Facebook Hateful Meme Challenge Solution

URL: http://arxiv.org/abs/2105.13132v1
Date: Tue, 25 May 2021 01:07:09 GMT
Title: Enhance Multimodal Model Performance with Data Augmentation: Facebook Hateful Meme Challenge Solution
Authors: Yang Li, Zinc Zhang, Hutchin Huang
Abstract summary: The Hateful Memes Challenge from Facebook helps fulfill such potential by challenging the contestants to detect hateful speech. In this paper, we utilize multi-modal, pre-trained models VilBERT and Visual BERT. Our approach achieved 0.7439 AUROC along with an accuracy of 0.7037 on the challenge's test set, which revealed remarkable progress.
Score: 3.8325907381729496
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hateful content detection is one of the areas where deep learning can and should make a significant difference. The Hateful Memes Challenge from Facebook helps fulfill such potential by challenging the contestants to detect hateful speech in multi-modal memes using deep learning algorithms. In this paper, we utilize multi-modal, pre-trained models VilBERT and Visual BERT. We improved models' performance by adding training datasets generated from data augmentation. Enlarging the training data set helped us get a more than 2% boost in terms of AUROC with the Visual BERT model. Our approach achieved 0.7439 AUROC along with an accuracy of 0.7037 on the challenge's test set, which revealed remarkable progress.

Related papers

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Data-Efficient Challenges in Visual Inductive Priors: A Retrospective [9.961131337487243]
Deep Learning requires large amounts of data to train models that work well.<n>In data-deficient settings, performance can be degraded.<n>We investigate which Deep Learning methods benefit training models in a data-deficient setting.
arXiv Detail & Related papers (2025-06-10T09:21:48Z)
Cross-Modal Transfer from Memes to Videos: Addressing Data Scarcity in Hateful Video Detection [8.05088621131726]
Video-based hate speech detection remains under-explored, hindered by a lack of annotated datasets and the high cost of video annotation. We leverage meme datasets as both a substitution and an augmentation strategy for training hateful video detection models. Our results consistently outperform state-of-the-art benchmarks.
arXiv Detail & Related papers (2025-01-26T07:50:14Z)
Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z)
Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
Leveraging Demonstrations to Improve Online Learning: Quality Matters [54.98983862640944]
We show that the degree of improvement must depend on the quality of the demonstration data. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes' rule.
arXiv Detail & Related papers (2023-02-07T08:49:12Z)
Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition. Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss. We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z)
Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models. We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z)
Hateful Memes Challenge: An Enhanced Multimodal Framework [0.0]
Hateful Meme Challenge proposed by Facebook AI has attracted contestants around the world. Various state-of-the-art deep learning models have been applied to this problem. In this paper, we enhance the hateful detection framework, including utilizing Detectron for feature extraction.
arXiv Detail & Related papers (2021-12-20T07:47:17Z)
Classification of Multimodal Hate Speech -- The Winning Solution of Hateful Memes Challenge [0.0]
Hateful Memes is a new challenge set for multimodal classification. Difficult examples are added to the dataset to make it hard to rely on unimodal signals. I propose a new model that combined multimodal with rules, which achieve the first ranking of accuracy and AUROC of 86.8% and 0.923 respectively.
arXiv Detail & Related papers (2020-12-02T07:38:26Z)
Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning [0.0]
We propose a novel transfer learning method for speech emotion recognition. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data.
arXiv Detail & Related papers (2020-11-11T06:18:31Z)
The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes [43.778346545763654]
This work proposes a new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes. It is constructed such that unimodal models struggle and only multimodal models can succeed. We find that state-of-the-art methods perform poorly compared to humans.
arXiv Detail & Related papers (2020-05-10T21:31:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.