Improving Multimodal Classification of Social Media Posts by Leveraging
Image-Text Auxiliary Tasks
- URL: http://arxiv.org/abs/2309.07794v2
- Date: Sat, 3 Feb 2024 22:42:29 GMT
- Title: Improving Multimodal Classification of Social Media Posts by Leveraging
Image-Text Auxiliary Tasks
- Authors: Danae S\'anchez Villegas, Daniel Preo\c{t}iuc-Pietro, Nikolaos Aletras
- Abstract summary: We present an extensive study on the effectiveness of using two auxiliary losses jointly with the main task during fine-tuning multimodal models.
First, Image-Text Contrastive (ITC) is designed to minimize the distance between image-text representations within a post.
Second, Image-Text Matching (ITM) enhances the model's ability to understand the semantic relationship between images and text.
- Score: 38.943074586111564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effectively leveraging multimodal information from social media posts is
essential to various downstream tasks such as sentiment analysis, sarcasm
detection or hate speech classification. Jointly modeling text and images is
challenging because cross-modal semantics might be hidden or the relation
between image and text is weak. However, prior work on multimodal
classification of social media posts has not yet addressed these challenges. In
this work, we present an extensive study on the effectiveness of using two
auxiliary losses jointly with the main task during fine-tuning multimodal
models. First, Image-Text Contrastive (ITC) is designed to minimize the
distance between image-text representations within a post, thereby effectively
bridging the gap between posts where the image plays an important role in
conveying the post's meaning. Second, Image-Text Matching (ITM) enhances the
model's ability to understand the semantic relationship between images and
text, thus improving its capacity to handle ambiguous or loosely related
modalities. We combine these objectives with five multimodal models across five
diverse social media datasets, demonstrating consistent improvements of up to
2.6 points F1. Our comprehensive analysis shows the specific scenarios where
each auxiliary task is most effective.
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - OT-Attack: Enhancing Adversarial Transferability of Vision-Language
Models via Optimal Transport Optimization [65.57380193070574]
Vision-language pre-training models are vulnerable to multi-modal adversarial examples.
Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples.
We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
arXiv Detail & Related papers (2023-12-07T16:16:50Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Multi-Granularity Cross-Modality Representation Learning for Named
Entity Recognition on Social Media [11.235498285650142]
Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content.
This work introduces the multi-granularity cross-modality representation learning.
Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets.
arXiv Detail & Related papers (2022-10-19T15:14:55Z) - Cross-Media Keyphrase Prediction: A Unified Framework with
Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z) - Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval [41.505920288928365]
multimodal data has inspired interest in cross-modal retrieval methods.
We propose novel within-modality losses which encourage semantic coherency in both the text and image subspaces.
Our method ensures that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed.
arXiv Detail & Related papers (2020-07-16T20:32:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.