Multimodal Categorization of Crisis Events in Social Media
- URL: http://arxiv.org/abs/2004.04917v1
- Date: Fri, 10 Apr 2020 06:31:30 GMT
- Title: Multimodal Categorization of Crisis Events in Social Media
- Authors: Mahdi Abavisani and Liwei Wu and Shengli Hu and Joel Tetreault and
Alejandro Jaimes
- Abstract summary: We present a new multimodal fusion method that leverages both images and texts as input.
In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities.
We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
- Score: 81.07061295887172
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent developments in image classification and natural language processing,
coupled with the rapid growth in social media usage, have enabled fundamental
advances in detecting breaking events around the world in real-time. Emergency
response is one such area that stands to gain from these advances. By
processing billions of texts and images a minute, events can be automatically
detected to enable emergency response workers to better assess rapidly evolving
situations and deploy resources accordingly. To date, most event detection
techniques in this area have focused on image-only or text-only approaches,
limiting detection performance and impacting the quality of information
delivered to crisis response teams. In this paper, we present a new multimodal
fusion method that leverages both images and texts as input. In particular, we
introduce a cross-attention module that can filter uninformative and misleading
components from weak modalities on a sample by sample basis. In addition, we
employ a multimodal graph-based approach to stochastically transition between
embeddings of different multimodal pairs during training to better regularize
the learning process as well as dealing with limited training data by
constructing new matched pairs from different samples. We show that our method
outperforms the unimodal approaches and strong multimodal baselines by a large
margin on three crisis-related tasks.
Related papers
- From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - TextAug: Test time Text Augmentation for Multimodal Person
Re-identification [8.557492202759711]
The bottleneck for multimodal deep learning is the need for a large volume of multimodal training examples.
Data augmentation techniques such as cropping, flipping, rotation, etc. are often employed in the image domain to improve the generalization of deep learning models.
In this study, we investigate the effectiveness of two computer vision data augmentation techniques: cutout and cutmix, for text augmentation in multi-modal person re-identification.
arXiv Detail & Related papers (2023-12-04T03:38:04Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Iterative Adversarial Attack on Image-guided Story Ending Generation [37.42908817585858]
Multimodal learning involves developing models that can integrate information from various sources like images and texts.
Deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples.
We propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks.
arXiv Detail & Related papers (2023-05-16T06:19:03Z) - Multi-modal Fake News Detection on Social Media via Multi-grained
Information Fusion [21.042970740577648]
We present a Multi-grained Multi-modal Fusion Network (MMFN) for fake news detection.
Inspired by the multi-grained process of human assessment of news authenticity, we respectively employ two Transformer-based pre-trained models to encode token-level features from text and images.
The multi-modal module fuses fine-grained features, taking into account coarse-grained features encoded by the CLIP encoder.
arXiv Detail & Related papers (2023-04-03T09:13:59Z) - Cross-modal Contrastive Learning for Multimodal Fake News Detection [10.760000041969139]
COOLANT is a cross-modal contrastive learning framework for multimodal fake news detection.
A cross-modal fusion module is developed to learn the cross-modality correlations.
An attention guidance module is implemented to help effectively and interpretably aggregate the aligned unimodal representations.
arXiv Detail & Related papers (2023-02-25T10:12:34Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Cross-Modal Generalization: Learning in Low Resource Modalities via
Meta-Alignment [99.29153138760417]
Cross-modal generalization is a learning paradigm to train a model that can quickly perform new tasks in a target modality.
We study a key research question: how can we ensure generalization across modalities despite using separate encoders for different source and target modalities?
Our solution is based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data.
arXiv Detail & Related papers (2020-12-04T19:27:26Z) - Cross-Media Keyphrase Prediction: A Unified Framework with
Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.