Hateful Memes Detection via Complementary Visual and Linguistic Networks
- URL: http://arxiv.org/abs/2012.04977v1
- Date: Wed, 9 Dec 2020 11:11:09 GMT
- Title: Hateful Memes Detection via Complementary Visual and Linguistic Networks
- Authors: Weibo Zhang, Guihua Liu, Zhuohua Li, Fuqing Zhu
- Abstract summary: We investigate a solution based on complementary visual and linguistic network in Hateful Memes Challenge 2020.
Both contextual-level and sensitive object-level information are considered in visual and linguistic embedding.
Results show that CVL provides a decent performance, and produces 78:48% and 72:95% on the criteria of AUROC and Accuracy.
- Score: 4.229588654547344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hateful memes are widespread in social media and convey negative information.
The main challenge of hateful memes detection is that the expressive meaning
can not be well recognized by a single modality. In order to further integrate
modal information, we investigate a candidate solution based on complementary
visual and linguistic network in Hateful Memes Challenge 2020. In this way,
more comprehensive information of the multi-modality could be explored in
detail. Both contextual-level and sensitive object-level information are
considered in visual and linguistic embedding to formulate the complex
multi-modal scenarios. Specifically, a pre-trained classifier and object
detector are utilized to obtain the contextual features and region-of-interests
(RoIs) from the input, followed by the position representation fusion for
visual embedding. While linguistic embedding is composed of three components,
i.e., the sentence words embedding, position embedding and the corresponding
Spacy embedding (Sembedding), which is a symbol represented by vocabulary
extracted by Spacy. Both visual and linguistic embedding are fed into the
designed Complementary Visual and Linguistic (CVL) networks to produce the
prediction for hateful memes. Experimental results on Hateful Memes Challenge
Dataset demonstrate that CVL provides a decent performance, and produces 78:48%
and 72:95% on the criteria of AUROC and Accuracy. Code is available at
https://github.com/webYFDT/hateful.
Related papers
- Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering [0.5587293092389789]
Hateful memes often evade traditional text-only or image-only detection systems, particularly when they employ subtle or coded references.
We propose a multimodal hate detection framework that integrates OCR to extract embedded text, captioning to describe visual content neutrally, sub-label classification for granular categorization, RAG for contextually relevant retrieval, and VQA for iterative analysis of symbolic and contextual cues.
Experimental results on the Facebook Hateful Memes dataset reveal that the proposed framework exceeds the performance of unimodal and conventional multimodal models in both accuracy and AUC-ROC.
arXiv Detail & Related papers (2025-04-23T13:52:14Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Align before Attend: Aligning Visual and Textual Features for Multimodal
Hateful Content Detection [4.997673761305336]
This paper proposes a context-aware attention framework for multimodal hateful content detection.
We evaluate the proposed approach on two benchmark hateful meme datasets, viz. MUTE (Bengali code-mixed) and MultiOFF (English)
arXiv Detail & Related papers (2024-02-15T06:34:15Z) - CLIP-Driven Semantic Discovery Network for Visible-Infrared Person
Re-Identification [39.262536758248245]
Cross-modality identity matching poses significant challenges in VIReID.
We propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration, and High-level Semantic Embedding.
arXiv Detail & Related papers (2024-01-11T10:20:13Z) - Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
Pretraining? [34.609984453754656]
We aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment.
Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark.
arXiv Detail & Related papers (2023-08-24T16:17:40Z) - OPI at SemEval 2023 Task 1: Image-Text Embeddings and Multimodal
Information Retrieval for Visual Word Sense Disambiguation [0.0]
We present our submission to SemEval 2023 visual word sense disambiguation shared task.
The proposed system integrates multimodal embeddings, learning to rank methods, and knowledge-based approaches.
Our solution was ranked third in the multilingual task and won in the Persian track, one of the three language subtasks.
arXiv Detail & Related papers (2023-04-14T13:45:59Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - What do you MEME? Generating Explanations for Visual Semantic Role
Labelling in Memes [42.357272117919464]
We introduce a novel task - EXCLAIM, generating explanations for visual semantic role labeling in memes.
To this end, we curate ExHVV, a novel dataset that offers natural language explanations of connotative roles for three types of entities.
We also posit LUMEN, a novel multimodal, multi-task learning framework that endeavors to address EXCLAIM optimally.
arXiv Detail & Related papers (2022-12-01T18:21:36Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - Exploiting BERT For Multimodal Target SentimentClassification Through
Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.