Exploring Vision Language Models for Multimodal and Multilingual Stance Detection
- URL: http://arxiv.org/abs/2501.17654v1
- Date: Wed, 29 Jan 2025 13:39:53 GMT
- Title: Exploring Vision Language Models for Multimodal and Multilingual Stance Detection
- Authors: Jake Vasilakes, Carolina Scarton, Zhixue Zhao,
- Abstract summary: Social media's global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks.
Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios relatively underexplored.
This paper evaluates state-of-the-art Vision-Language Models (VLMs) on multimodal and multilingual stance detection tasks.
- Score: 9.079302402271491
- License:
- Abstract: Social media's global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks like stance detection across languages and modalities. Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios, such as those involving both images and text, relatively underexplored. Meanwhile, the prevalence of multimodal posts has increased significantly in recent years. Although state-of-the-art Vision-Language Models (VLMs) show promise, their performance on multimodal and multilingual stance detection tasks remains largely unexamined. This paper evaluates state-of-the-art VLMs on a newly extended dataset covering seven languages and multimodal inputs, investigating their use of visual cues, language-specific performance, and cross-modality interactions. Our results show that VLMs generally rely more on text than images for stance detection and this trend persists across languages. Additionally, VLMs rely significantly more on text contained within the images than other visual content. Regarding multilinguality, the models studied tend to generate consistent predictions across languages whether they are explicitly multilingual or not, although there are outliers that are incongruous with macro F1, language support, and model size.
Related papers
- Cross-modal Information Flow in Multimodal Large Language Models [14.853197288189579]
We investigate the information flow between different modalities -- language and vision -- in large language models.
We find that there are two distinct stages in the process of integration of the two modalities.
arXiv Detail & Related papers (2024-11-27T18:59:26Z) - ICU: Conquering Language Barriers in Vision-and-Language Modeling by
Dividing the Tasks into Image Captioning and Language Understanding [1.9906814758497542]
ICU, Image Caption Understanding, divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM) takes the caption as the alt text and performs cross-lingual language understanding.
We show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
arXiv Detail & Related papers (2023-10-19T07:11:48Z) - RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z) - TextMI: Textualize Multimodal Information for Integrating Non-verbal
Cues in Pre-trained Language Models [5.668457303716451]
We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks.
Our approach significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks.
arXiv Detail & Related papers (2023-03-27T17:54:32Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
Translation [94.33019040320507]
Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features.
Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases.
We propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.
arXiv Detail & Related papers (2022-10-19T12:21:39Z) - xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Unsupervised Multimodal Neural Machine Translation with Pseudo Visual
Pivoting [105.5303416210736]
Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only.
It is still challenging to associate source-target sentences in the latent space.
As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising.
arXiv Detail & Related papers (2020-05-06T20:11:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.