MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification
- URL: http://arxiv.org/abs/2409.14703v2
- Date: Mon, 28 Oct 2024 03:32:28 GMT
- Title: MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification
- Authors: Siddhant Bikram Shah, Shuvam Shiwakoti, Maheep Chaudhary, Haohan Wang,
- Abstract summary: We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement.
We propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model.
- Score: 11.270267165348626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.
Related papers
- Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs [28.20725794099928]
We present UniME, a novel framework that learns discriminative representations for diverse downstream tasks.
In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model.
In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning.
arXiv Detail & Related papers (2025-04-24T10:51:52Z) - CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation [3.1667055223489786]
Contrastive Language-Image Pre-training models excel in zero-shot classification, yet face challenges in complex multi-object scenarios.
This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO.
Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects.
arXiv Detail & Related papers (2025-02-27T07:34:42Z) - Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision? [62.12375949429938]
Building transferable Graph Neural Networks (GNNs) with CLIP pipeline is challenging because of three fundamental issues.
We leverage multi-modal prompt learning to effectively adapt pre-trained GNN to downstream tasks and data.
Our new paradigm embeds the graphs directly in the same space as the Large Language Models (LLMs) by learning both graph prompts and text prompts simultaneously.
arXiv Detail & Related papers (2024-12-11T08:03:35Z) - ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements [0.6990493129893112]
Recent advances in Vision Language Models have reshaped the evaluation paradigm in computer vision tasks.
These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks.
In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications.
Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks.
arXiv Detail & Related papers (2024-11-18T20:31:38Z) - An Information Criterion for Controlled Disentanglement of Multimodal Data [39.601584166020274]
Multimodal representation learning seeks to relate and decompose information inherent in multiple modalities.
Disentangled Self-Supervised Learning (DisentangledSSL) is a novel self-supervised approach for learning disentangled representations.
arXiv Detail & Related papers (2024-10-31T14:57:31Z) - Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities.
CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure.
We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - SpeechCLIP+: Self-supervised multi-task representation learning for
speech via CLIP and speech-image data [69.20254987896674]
SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription.
This paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture.
Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework.
arXiv Detail & Related papers (2024-02-10T14:26:42Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Learning to Compose Diversified Prompts for Image Emotion Classification [5.586293129420233]
Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained vision-language models.
CLIP has recently shown its superior power on a wide range of downstream vision-language tasks like Visual Question Answering.
We propose a general framework that shows how CLIP can be effectively applied to Image Emotion Classification.
arXiv Detail & Related papers (2022-01-26T14:31:55Z) - Multi-Task Learning for Dense Prediction Tasks: A Survey [87.66280582034838]
Multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint.
We provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision.
arXiv Detail & Related papers (2020-04-28T09:15:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.