Volta at SemEval-2021 Task 6: Towards Detecting Persuasive Texts and
Images using Textual and Multimodal Ensemble
- URL: http://arxiv.org/abs/2106.00240v1
- Date: Tue, 1 Jun 2021 05:41:03 GMT
- Title: Volta at SemEval-2021 Task 6: Towards Detecting Persuasive Texts and
Images using Textual and Multimodal Ensemble
- Authors: Kshitij Gupta, Devansh Gautam, Radhika Mamidi
- Abstract summary: We propose a transfer learning approach to fine-tune BERT-based models in different modalities.
We achieve an F1-score of 57.0, 48.2, and 52.1 in the corresponding subtasks.
- Score: 7.817598216459955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Memes are one of the most popular types of content used to spread information
online. They can influence a large number of people through rhetorical and
psychological techniques. The task, Detection of Persuasion Techniques in Texts
and Images, is to detect these persuasive techniques in memes. It consists of
three subtasks: (A) Multi-label classification using textual content, (B)
Multi-label classification and span identification using textual content, and
(C) Multi-label classification using visual and textual content. In this paper,
we propose a transfer learning approach to fine-tune BERT-based models in
different modalities. We also explore the effectiveness of ensembles of models
trained in different modalities. We achieve an F1-score of 57.0, 48.2, and 52.1
in the corresponding subtasks.
Related papers
- Contrastive Learning-based Multi Modal Architecture for Emoticon Prediction by Employing Image-Text Pairs [13.922091192207718]
This research aims to analyze the relationship among sentences, visuals, and emoticons.
We have proposed a novel contrastive learning based multimodal architecture.
The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons.
arXiv Detail & Related papers (2024-08-05T15:45:59Z) - IITK at SemEval-2024 Task 4: Hierarchical Embeddings for Detection of Persuasion Techniques in Memes [4.679320772294786]
This paper proposes an ensemble of Class Definition Prediction (CDP) and hyperbolic embeddings-based approaches for this task.
We enhance meme classification accuracy and comprehensiveness by integrating HypEmo's hierarchical label embeddings and a multi-task learning framework for emotion prediction.
arXiv Detail & Related papers (2024-04-06T06:28:02Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge
Transfer [55.885555581039895]
Multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding.
We propose a novel open-vocabulary framework, named multimodal knowledge transfer (MKT) for multi-label classification.
arXiv Detail & Related papers (2022-07-05T08:32:18Z) - MultiGBS: A multi-layer graph approach to biomedical summarization [6.11737116137921]
We propose a domain-specific method that models a document as a multi-layer graph to enable multiple features of the text to be processed at the same time.
The unsupervised method selects sentences from the multi-layer graph based on the MultiRank algorithm and the number of concepts.
The proposed MultiGBS algorithm employs UMLS and extracts the concepts and relationships using different tools such as SemRep, MetaMap, and OGER.
arXiv Detail & Related papers (2020-08-27T04:22:37Z) - Evaluating Multimodal Representations on Visual Semantic Textual
Similarity [22.835699807110018]
We present a novel task, Visual Semantic Textual Similarity (vSTS), where such inference ability can be tested directly.
Our experiments using simple multimodal representations show that the addition of image representations produces better inference, compared to text-only representations.
Our work shows, for the first time, the successful contribution of visual information to textual inference, with ample room for more complex multimodal representation options.
arXiv Detail & Related papers (2020-04-04T09:03:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.