MemeTector: Enforcing deep focus for meme detection
        - URL: http://arxiv.org/abs/2205.13268v1
- Date: Thu, 26 May 2022 10:50:29 GMT
- Title: MemeTector: Enforcing deep focus for meme detection
- Authors: Christos Koutlis, Manos Schinas, Symeon Papadopoulos
- Abstract summary: It is important to accurately retrieve image memes from social media to better capture the cultural and social aspects of online phenomena.
We propose a methodology that utilizes the visual part of image memes as instances of the regular image class and the initial image memes.
We employ a trainable attention mechanism on top of a standard ViT architecture to enhance the model's ability to focus on these critical parts.
- Score: 8.794414326545697
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract:   Image memes and specifically their widely-known variation image macros, is a
special new media type that combines text with images and is used in social
media to playfully or subtly express humour, irony, sarcasm and even hate. It
is important to accurately retrieve image memes from social media to better
capture the cultural and social aspects of online phenomena and detect
potential issues (hate-speech, disinformation). Essentially, the background
image of an image macro is a regular image easily recognized as such by humans
but cumbersome for the machine to do so due to feature map similarity with the
complete image macro. Hence, accumulating suitable feature maps in such cases
can lead to deep understanding of the notion of image memes. To this end, we
propose a methodology that utilizes the visual part of image memes as instances
of the regular image class and the initial image memes as instances of the
image meme class to force the model to concentrate on the critical parts that
characterize an image meme. Additionally, we employ a trainable attention
mechanism on top of a standard ViT architecture to enhance the model's ability
to focus on these critical parts and make the predictions interpretable.
Several training and test scenarios involving web-scraped regular images of
controlled text presence are considered in terms of model robustness and
accuracy. The findings indicate that light visual part utilization combined
with sufficient text presence during training provides the best and most robust
model, surpassing state of the art.
 
      
        Related papers
        - MemeReaCon: Probing Contextual Meme Understanding in Large   Vision-Language Models [50.2355423914562]
 We introduce MemeReaCon, a novel benchmark designed to evaluate how Large Vision Language Models (LVLMs) understand memes in their original context.<n>We collected memes from five different Reddit communities, keeping each meme's image, the post text, and user comments together.<n>Our tests with leading LVLMs show a clear weakness: models either fail to interpret critical information in the contexts, or overly focus on visual details while overlooking communicative purpose.
 arXiv  Detail & Related papers  (2025-05-23T03:27:23Z)
- Large Vision-Language Models for Knowledge-Grounded Data Annotation of   Memes [5.243460995467895]
 This study introduces ClassicMemes-50-templates (CM50), a large-scale dataset consisting of over 33,000 memes, centered around 50 popular meme templates.
We also present an automated knowledge-grounded annotation pipeline leveraging large vision-language models to produce high-quality image captions, meme captions, and literary device labels.
 arXiv  Detail & Related papers  (2025-01-23T17:18:30Z)
- Decoding Memes: A Comparative Study of Machine Learning Models for   Template Identification [0.0]
 "meme template" is a layout or format that is used to create memes.
Despite extensive research on meme virality, the task of automatically identifying meme templates remains a challenge.
This paper presents a comprehensive comparison and evaluation of existing meme template identification methods.
 arXiv  Detail & Related papers  (2024-08-15T12:52:06Z)
- XMeCap: Meme Caption Generation with Sub-Image Adaptability [53.2509590113364]
 Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines.
We introduce the textscXMeCap framework, which adopts supervised fine-tuning and reinforcement learning.
textscXMeCap achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71% and 4.82%, respectively.
 arXiv  Detail & Related papers  (2024-07-24T10:51:46Z)
- Meme-ingful Analysis: Enhanced Understanding of Cyberbullying in Memes
  Through Multimodal Explanations [48.82168723932981]
 We introduce em MultiBully-Ex, the first benchmark dataset for multimodal explanation from code-mixed cyberbullying memes.
A Contrastive Language-Image Pretraining (CLIP) approach has been proposed for visual and textual explanation of a meme.
 arXiv  Detail & Related papers  (2024-01-18T11:24:30Z)
- A Template Is All You Meme [83.05919383106715]
 We release a knowledge base of memes and information found on www.knowyourmeme.com, composed of more than 54,000 images.
We hypothesize that meme templates can be used to inject models with the context missing from previous approaches.
 arXiv  Detail & Related papers  (2023-11-11T19:38:14Z)
- MemeGraphs: Linking Memes to Knowledge Graphs [5.857287622337647]
 We propose to use scene graphs, that express images in terms of objects and their visual relations, and knowledge graphs as structured representations for meme classification with a Transformer-based architecture.
We compare our approach with ImgBERT, a multimodal model that uses only learned (instead of structured) representations of the meme, and observe consistent improvements.
Analysis shows that automatic methods link more entities than human annotators and that automatically generated graphs are better suited for hatefulness classification in memes.
 arXiv  Detail & Related papers  (2023-05-28T11:17:30Z)
- Hate-CLIPper: Multimodal Hateful Meme Classification based on
  Cross-modal Interaction of CLIP Features [5.443781798915199]
 Hateful memes are a growing menace on social media.
 detecting hateful memes requires careful consideration of both visual and textual information.
We propose the Hate-CLIPper architecture, which explicitly models the cross-modal interactions between the image and text representations.
 arXiv  Detail & Related papers  (2022-10-12T04:34:54Z)
- BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
  Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
 Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
 arXiv  Detail & Related papers  (2022-07-09T07:14:44Z)
- Caption Enriched Samples for Improving Hateful Memes Detection [78.5136090997431]
 The hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not.
Both unimodal language models and multimodal vision-language models cannot reach the human level of performance.
 arXiv  Detail & Related papers  (2021-09-22T10:57:51Z)
- Do Images really do the Talking? Analysing the significance of Images in
  Tamil Troll meme classification [0.16863755729554888]
 We try to explore the significance of visual features of images in classifying memes.
We try to incorporate the memes as troll and non-trolling memes based on the images and the text on them.
 arXiv  Detail & Related papers  (2021-08-09T09:04:42Z)
- Cross-Media Keyphrase Prediction: A Unified Framework with
  Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
 We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
 arXiv  Detail & Related papers  (2020-11-03T08:44:18Z)
- Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
 In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
 arXiv  Detail & Related papers  (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.