Caption Enriched Samples for Improving Hateful Memes Detection
- URL: http://arxiv.org/abs/2109.10649v1
- Date: Wed, 22 Sep 2021 10:57:51 GMT
- Title: Caption Enriched Samples for Improving Hateful Memes Detection
- Authors: Efrat Blaier, Itzik Malkiel, Lior Wolf
- Abstract summary: The hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not.
Both unimodal language models and multimodal vision-language models cannot reach the human level of performance.
- Score: 78.5136090997431
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recently introduced hateful meme challenge demonstrates the difficulty of
determining whether a meme is hateful or not. Specifically, both unimodal
language models and multimodal vision-language models cannot reach the human
level of performance. Motivated by the need to model the contrast between the
image content and the overlayed text, we suggest applying an off-the-shelf
image captioning tool in order to capture the first. We demonstrate that the
incorporation of such automatic captions during fine-tuning improves the
results for various unimodal and multimodal models. Moreover, in the unimodal
case, continuing the pre-training of language models on augmented and original
caption pairs, is highly beneficial to the classification accuracy.
Related papers
- Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models [63.01630478059315]
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance.
It is not clear whether synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood.
We propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models.
arXiv Detail & Related papers (2024-10-03T17:54:52Z) - FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers.
We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z) - Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification [4.1205832766381985]
We revisit language bottleneck models as an approach to ensuring the explainability of deep learning models for image classification.
We experimentally show that a language bottleneck model that combines a modern image captioner with a pre-trained language model can achieve image classification accuracy that exceeds that of black-box models.
arXiv Detail & Related papers (2024-06-22T10:49:34Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Macroscopic Control of Text Generation for Image Captioning [4.742874328556818]
Two novel methods are introduced to solve the problems respectively.
For the former problem, we introduce a control signal which can control the macroscopic sentence attributes, such as sentence quality, sentence length, sentence tense and number of nouns etc.
For the latter problem, we innovatively propose a strategy that an image-text matching model is trained to measure the quality of sentences generated in both forward and backward directions and finally choose the better one.
arXiv Detail & Related papers (2021-01-20T07:20:07Z) - Detecting Hate Speech in Multi-modal Memes [14.036769355498546]
We focus on hate speech detection in multi-modal memes wherein memes pose an interesting multi-modal fusion problem.
We aim to solve the Facebook Meme Challenge citekiela 2020hateful which aims to solve a binary classification problem of predicting whether a meme is hateful or not.
arXiv Detail & Related papers (2020-12-29T18:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.