MARMOT: A Deep Learning Framework for Constructing Multimodal
Representations for Vision-and-Language Tasks
- URL: http://arxiv.org/abs/2109.11526v1
- Date: Thu, 23 Sep 2021 17:48:48 GMT
- Title: MARMOT: A Deep Learning Framework for Constructing Multimodal
Representations for Vision-and-Language Tasks
- Authors: Patrick Y. Wu, Walter R. Mebane Jr
- Abstract summary: This paper proposes a novel vision-and-language framework called multimodal representations using modality translation (MARMOT)
MARMOT outperforms an ensemble text-only classifier in 19 of 20 categories in multilabel classifications of tweets reporting election incidents during the 2016 U.S. general election.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Political activity on social media presents a data-rich window into political
behavior, but the vast amount of data means that almost all content analyses of
social media require a data labeling step. However, most automated machine
classification methods ignore the multimodality of posted content, focusing
either on text or images. State-of-the-art vision-and-language models are
unusable for most political science research: they require all observations to
have both image and text and require computationally expensive pretraining.
This paper proposes a novel vision-and-language framework called multimodal
representations using modality translation (MARMOT). MARMOT presents two
methodological contributions: it can construct representations for observations
missing image or text, and it replaces the computationally expensive
pretraining with modality translation. MARMOT outperforms an ensemble text-only
classifier in 19 of 20 categories in multilabel classifications of tweets
reporting election incidents during the 2016 U.S. general election. Moreover,
MARMOT shows significant improvements over the results of benchmark multimodal
models on the Hateful Memes dataset, improving the best result set by
VisualBERT in terms of accuracy from 0.6473 to 0.6760 and area under the
receiver operating characteristic curve (AUC) from 0.7141 to 0.7530.
Related papers
- New Benchmark Dataset and Fine-Grained Cross-Modal Fusion Framework for Vietnamese Multimodal Aspect-Category Sentiment Analysis [1.053698976085779]
We introduce a new Vietnamese multimodal dataset, named ViMACSA, which consists of 4,876 text-image pairs with 14,618 fine-grained annotations for both text and image in the hotel domain.
We propose a Fine-Grained Cross-Modal Fusion Framework (FCMF) that effectively learns both intra- and inter-modality interactions and then fuses these information to produce a unified multimodal representation.
Experimental results show that our framework outperforms SOTA models on the ViMACSA dataset, achieving the highest F1 score of 79.73%.
arXiv Detail & Related papers (2024-05-01T14:29:03Z) - Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized
Visual Class Discovery [69.91441987063307]
Generalized Category Discovery (GCD) aims to cluster unlabeled data from both known and unknown categories.
Current GCD methods rely on only visual cues, which neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories.
We propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models.
arXiv Detail & Related papers (2024-03-12T07:06:50Z) - Text2Topic: Multi-Label Text Classification System for Efficient Topic
Detection in User Generated Content with Zero-Shot Capabilities [2.7311827519141363]
We propose Text to Topic (Text2Topic), which achieves high multi-label classification performance.
Text2Topic supports zero-shot predictions, produces domain-specific text embeddings, and enables production-scale batch-inference.
The model is deployed on a real-world stream processing platform, and it outperforms other models with 92.9% micro mAP.
arXiv Detail & Related papers (2023-10-23T11:33:24Z) - T-MARS: Improving Visual Representations by Circumventing Text Feature Learning [99.3682210827572]
We propose a new data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption.
Our simple and scalable approach, T-MARS, filters out only those pairs where the text dominates the remaining visual features.
arXiv Detail & Related papers (2023-07-06T16:59:52Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Exploiting BERT For Multimodal Target SentimentClassification Through
Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.