The Emotions of the Crowd: Learning Image Sentiment from Tweets via
Cross-modal Distillation
- URL: http://arxiv.org/abs/2304.14942v1
- Date: Fri, 28 Apr 2023 15:56:02 GMT
- Title: The Emotions of the Crowd: Learning Image Sentiment from Tweets via
Cross-modal Distillation
- Authors: Alessio Serra, Fabio Carrara, Maurizio Tesconi and Fabrizio Falchi
- Abstract summary: We propose an automated approach for building sentiment polarity classifiers based on a cross-modal distillation paradigm.
We applied our method to randomly collected images crawled from Twitter over three months and produced a weakly-labeled dataset.
- Score: 7.5543161581406775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Trends and opinion mining in social media increasingly focus on novel
interactions involving visual media, like images and short videos, in addition
to text. In this work, we tackle the problem of visual sentiment analysis of
social media images -- specifically, the prediction of image sentiment
polarity. While previous work relied on manually labeled training sets, we
propose an automated approach for building sentiment polarity classifiers based
on a cross-modal distillation paradigm; starting from scraped multimodal (text
+ images) data, we train a student model on the visual modality based on the
outputs of a textual teacher model that analyses the sentiment of the
corresponding textual modality. We applied our method to randomly collected
images crawled from Twitter over three months and produced, after automatic
cleaning, a weakly-labeled dataset of $\sim$1.5 million images. Despite
exploiting noisy labeled samples, our training pipeline produces classifiers
showing strong generalization capabilities and outperforming the current state
of the art on five manually labeled benchmarks for image sentiment polarity
prediction.
Related papers
- Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - Shatter and Gather: Learning Referring Image Segmentation with Text
Supervision [52.46081425504072]
We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent.
Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
arXiv Detail & Related papers (2023-08-29T15:39:15Z) - Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions.
We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions.
We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - Correlational Image Modeling for Self-Supervised Visual Pre-Training [81.82907503764775]
Correlational Image Modeling is a novel and surprisingly effective approach to self-supervised visual pre-training.
Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task.
arXiv Detail & Related papers (2023-03-22T15:48:23Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Transfer Learning with Joint Fine-Tuning for Multimodal Sentiment
Analysis [0.6091702876917281]
We introduce a transfer learning approach using joint fine-tuning for sentiment analysis.
Our proposal allows flexibility when incorporating any pre-trained model for texts and images during the joint fine-tuning stage.
arXiv Detail & Related papers (2022-10-11T21:16:14Z) - An AutoML-based Approach to Multimodal Image Sentiment Analysis [1.0499611180329804]
We propose a method that combines both textual and image individual sentiment analysis into a final fused classification based on AutoML.
Our method achieved state-of-the-art performance in the B-T4SA dataset, with 95.19% accuracy.
arXiv Detail & Related papers (2021-02-16T11:28:50Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.