Related papers: RoMemes: A multimodal meme corpus for the Romanian language

Related papers

MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
MemeReaCon: Probing Contextual Meme Understanding in Large Vision-Language Models [50.2355423914562]
We introduce MemeReaCon, a novel benchmark designed to evaluate how Large Vision Language Models (LVLMs) understand memes in their original context.<n>We collected memes from five different Reddit communities, keeping each meme's image, the post text, and user comments together.<n>Our tests with leading LVLMs show a clear weakness: models either fail to interpret critical information in the contexts, or overly focus on visual details while overlooking communicative purpose.
arXiv Detail & Related papers (2025-05-23T03:27:23Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes [5.243460995467895]
This study introduces ClassicMemes-50-templates (CM50), a large-scale dataset consisting of over 33,000 memes, centered around 50 popular meme templates. We also present an automated knowledge-grounded annotation pipeline leveraging large vision-language models to produce high-quality image captions, meme captions, and literary device labels.
arXiv Detail & Related papers (2025-01-23T17:18:30Z)
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining [86.76706820098867]
We introduce a high-quality textbfmultimodal textbook corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment.
arXiv Detail & Related papers (2025-01-01T21:29:37Z)
XMeCap: Meme Caption Generation with Sub-Image Adaptability [53.2509590113364]
Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. We introduce the textscXMeCap framework, which adopts supervised fine-tuning and reinforcement learning. textscXMeCap achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71% and 4.82%, respectively.
arXiv Detail & Related papers (2024-07-24T10:51:46Z)
ArMeme: Propagandistic Content in Arabic Memes [9.48177009736915]
We develop an Arabic memes dataset with manual annotations of propagandistic content. We provide a comprehensive analysis aiming to develop computational tools for their detection.
arXiv Detail & Related papers (2024-06-06T09:56:49Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
Meme-ingful Analysis: Enhanced Understanding of Cyberbullying in Memes Through Multimodal Explanations [48.82168723932981]
We introduce em MultiBully-Ex, the first benchmark dataset for multimodal explanation from code-mixed cyberbullying memes. A Contrastive Language-Image Pretraining (CLIP) approach has been proposed for visual and textual explanation of a meme.
arXiv Detail & Related papers (2024-01-18T11:24:30Z)
MEMEX: Detecting Explanatory Evidence for Memes via Knowledge-Enriched Contextualization [31.209594252045566]
We propose a novel task, MEMEX, given a meme and a related document, the aim is to mine the context that succinctly explains the background of the meme. To benchmark MCC, we propose MIME, a multimodal neural framework that uses common sense enriched meme representation and a layered approach to capture the cross-modal semantic dependencies between the meme and the context.
arXiv Detail & Related papers (2023-05-25T10:19:35Z)
Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z)
What do you MEME? Generating Explanations for Visual Semantic Role Labelling in Memes [42.357272117919464]
We introduce a novel task - EXCLAIM, generating explanations for visual semantic role labeling in memes. To this end, we curate ExHVV, a novel dataset that offers natural language explanations of connotative roles for three types of entities. We also posit LUMEN, a novel multimodal, multi-task learning framework that endeavors to address EXCLAIM optimally.
arXiv Detail & Related papers (2022-12-01T18:21:36Z)
Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling. We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)
Do Images really do the Talking? Analysing the significance of Images in Tamil Troll meme classification [0.16863755729554888]
We try to explore the significance of visual features of images in classifying memes. We try to incorporate the memes as troll and non-trolling memes based on the images and the text on them.
arXiv Detail & Related papers (2021-08-09T09:04:42Z)
NLP-CUET@DravidianLangTech-EACL2021: Investigating Visual and Textual Features to Identify Trolls from Multimodal Social Media Memes [0.0]
A shared task is organized to develop models that can identify trolls from multimodal social media memes. This work presents a computational model that we have developed as part of our participation in the task. We investigated the visual and textual features using CNN, VGG16, Inception, Multilingual-BERT, XLM-Roberta, XLNet models.
arXiv Detail & Related papers (2021-02-28T11:36:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.