Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme
Detection
- URL: http://arxiv.org/abs/2308.08088v1
- Date: Wed, 16 Aug 2023 01:38:49 GMT
- Title: Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme
Detection
- Authors: Rui Cao, Ming Shan Hee, Adriel Kuek, Wen-Haw Chong, Roy Ka-Wei Lee,
Jing Jiang
- Abstract summary: We propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner.
Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions.
The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method.
- Score: 17.182722268446604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hateful meme detection is a challenging multimodal task that requires
comprehension of both vision and language, as well as cross-modal interactions.
Recent studies have tried to fine-tune pre-trained vision-language models
(PVLMs) for this task. However, with increasing model sizes, it becomes
important to leverage powerful PVLMs more efficiently, rather than simply
fine-tuning them. Recently, researchers have attempted to convert meme images
into textual captions and prompt language models for predictions. This approach
has shown good performance but suffers from non-informative image captions.
Considering the two factors mentioned above, we propose a probing-based
captioning approach to leverage PVLMs in a zero-shot visual question answering
(VQA) manner. Specifically, we prompt a frozen PVLM by asking hateful
content-related questions and use the answers as image captions (which we call
Pro-Cap), so that the captions contain information critical for hateful content
detection. The good performance of models with Pro-Cap on three benchmarks
validates the effectiveness and generalization of the proposed method.
Related papers
- AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [73.62572976072578]
We propose AuroraCap, a video captioner based on a large multimodal model.
We implement the token merging strategy, reducing the number of input visual tokens.
AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation [34.45033554641476]
Existing automatic captioning methods for visual content face challenges such as lack of detail, hallucination content, and poor instruction following.
We propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects.
VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions.
arXiv Detail & Related papers (2024-04-30T17:55:27Z) - Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts [3.6064695344878093]
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content.
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
arXiv Detail & Related papers (2024-04-12T16:35:23Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Zero-shot Visual Question Answering with Language Model Feedback [83.65140324876536]
We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA)
Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
arXiv Detail & Related papers (2023-05-26T15:04:20Z) - MAViC: Multimodal Active Learning for Video Captioning [8.454261564411436]
In this paper, we introduce MAViC to address the challenges of active learning approaches for video captioning.
Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function.
arXiv Detail & Related papers (2022-12-11T18:51:57Z) - A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based
Learning for Vision-Language Models [50.27305012063483]
FewVLM is a few-shot prompt-based learner on vision-language tasks.
We pretrain a sequence-to-sequence Transformer model with both prefix language modeling (PrefixLM) and masked language modeling (MaskedLM)
We observe that prompts significantly affect zero-shot performance but marginally affect few-shot performance.
arXiv Detail & Related papers (2021-10-16T06:07:59Z) - Caption Enriched Samples for Improving Hateful Memes Detection [78.5136090997431]
The hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not.
Both unimodal language models and multimodal vision-language models cannot reach the human level of performance.
arXiv Detail & Related papers (2021-09-22T10:57:51Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.