Transfer Learning with Joint Fine-Tuning for Multimodal Sentiment
Analysis
- URL: http://arxiv.org/abs/2210.05790v1
- Date: Tue, 11 Oct 2022 21:16:14 GMT
- Title: Transfer Learning with Joint Fine-Tuning for Multimodal Sentiment
Analysis
- Authors: Guilherme Louren\c{c}o de Toledo and Ricardo Marcondes Marcacini
- Abstract summary: We introduce a transfer learning approach using joint fine-tuning for sentiment analysis.
Our proposal allows flexibility when incorporating any pre-trained model for texts and images during the joint fine-tuning stage.
- Score: 0.6091702876917281
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing methods focus on sentiment analysis of textual data. However,
recently there has been a massive use of images and videos on social platforms,
motivating sentiment analysis from other modalities. Current studies show that
exploring other modalities (e.g., images) increases sentiment analysis
performance. State-of-the-art multimodal models, such as CLIP and VisualBERT,
are pre-trained on datasets with the text paired with images. Although the
results obtained by these models are promising, pre-training and sentiment
analysis fine-tuning tasks of these models are computationally expensive. This
paper introduces a transfer learning approach using joint fine-tuning for
sentiment analysis. Our proposal achieved competitive results using a more
straightforward alternative fine-tuning strategy that leverages different
pre-trained unimodal models and efficiently combines them in a multimodal
space. Moreover, our proposal allows flexibility when incorporating any
pre-trained model for texts and images during the joint fine-tuning stage,
being especially interesting for sentiment classification in low-resource
scenarios.
Related papers
- Utilizing Large Language Models for Event Deconstruction to Enhance Multimodal Aspect-Based Sentiment Analysis [2.1329326061804816]
This paper introduces Large Language Models (LLMs) for event decomposition and proposes a reinforcement learning framework for Multimodal Aspect-based Sentiment Analysis (MABSA-RL)
Experimental results show that MABSA-RL outperforms existing advanced methods on two benchmark datasets.
arXiv Detail & Related papers (2024-10-18T03:40:45Z) - YaART: Yet Another ART Rendering Technology [119.09155882164573]
This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences.
We analyze how these choices affect both the efficiency of the training process and the quality of the generated images.
We demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets.
arXiv Detail & Related papers (2024-04-08T16:51:19Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - OT-Attack: Enhancing Adversarial Transferability of Vision-Language
Models via Optimal Transport Optimization [65.57380193070574]
Vision-language pre-training models are vulnerable to multi-modal adversarial examples.
Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples.
We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
arXiv Detail & Related papers (2023-12-07T16:16:50Z) - Self-training Strategies for Sentiment Analysis: An Empirical Study [7.416913210816592]
Self-training is an economical and efficient technique for developing sentiment analysis models.
We compare several self-training strategies with the intervention of large language models.
arXiv Detail & Related papers (2023-09-15T21:42:46Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Informative Sample Mining Network for Multi-Domain Image-to-Image
Translation [101.01649070998532]
We show that improving the sample selection strategy is an effective solution for image-to-image translation tasks.
We propose a novel multi-stage sample training scheme to reduce sample hardness while preserving sample informativeness.
arXiv Detail & Related papers (2020-01-05T05:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.