Argumentative Stance Prediction: An Exploratory Study on Multimodality
and Few-Shot Learning
- URL: http://arxiv.org/abs/2310.07093v1
- Date: Wed, 11 Oct 2023 00:18:29 GMT
- Title: Argumentative Stance Prediction: An Exploratory Study on Multimodality
and Few-Shot Learning
- Authors: Arushi Sharma, Abhibha Gupta, Maneesh Bilalpur
- Abstract summary: We evaluate the necessity of images for stance prediction in tweets.
Our work suggests an ensemble of fine-tuned text-based language models.
Our findings suggest that the multimodal models tend to perform better when image content is summarized as natural language.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To advance argumentative stance prediction as a multimodal problem, the First
Shared Task in Multimodal Argument Mining hosted stance prediction in crucial
social topics of gun control and abortion. Our exploratory study attempts to
evaluate the necessity of images for stance prediction in tweets and compare
out-of-the-box text-based large-language models (LLM) in few-shot settings
against fine-tuned unimodal and multimodal models. Our work suggests an
ensemble of fine-tuned text-based language models (0.817 F1-score) outperforms
both the multimodal (0.677 F1-score) and text-based few-shot prediction using a
recent state-of-the-art LLM (0.550 F1-score). In addition to the differences in
performance, our findings suggest that the multimodal models tend to perform
better when image content is summarized as natural language over their native
pixel structure and, using in-context examples improves few-shot performance of
LLMs.
Related papers
- P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning [53.766434746801366]
Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet.
Hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information.
Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection.
We propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples.
arXiv Detail & Related papers (2024-07-23T09:00:52Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - Benchmarking Sequential Visual Input Reasoning and Prediction in
Multimodal Large Language Models [21.438427686724932]
We introduce a novel benchmark that assesses the predictive reasoning capabilities of MLLMs across diverse scenarios.
Our benchmark targets three important domains: abstract pattern reasoning, human activity prediction, and physical interaction prediction.
Empirical experiments confirm the soundness of the proposed benchmark and evaluation methods.
arXiv Detail & Related papers (2023-10-20T13:14:38Z) - A Comparative Analysis of Pretrained Language Models for Text-to-Speech [13.962029761484022]
State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech.
While PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked.
This study is the first study of its kind to investigate the impact of different PLMs on TTS.
arXiv Detail & Related papers (2023-09-04T13:02:27Z) - A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.