M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal
Aspect-based Sentiment Analysis
- URL: http://arxiv.org/abs/2310.14605v1
- Date: Mon, 23 Oct 2023 06:22:39 GMT
- Title: M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal
Aspect-based Sentiment Analysis
- Authors: Fei Zhao, Chunhui Li, Zhen Wu, Yawen Ouyang, Jianbing Zhang, Xinyu Dai
- Abstract summary: Multimodal Aspect-based Sentiment Analysis (MABSA) is a fine-grained Sentiment Analysis task.
We propose a Multi-grained Multi-curriculum Denoising Framework (M2DF) which can achieve denoising by adjusting the order of training data.
Our framework consistently outperforms state-of-the-art work on three sub-tasks of MABSA.
- Score: 32.9772577419091
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Aspect-based Sentiment Analysis (MABSA) is a fine-grained
Sentiment Analysis task, which has attracted growing research interests
recently. Existing work mainly utilizes image information to improve the
performance of MABSA task. However, most of the studies overestimate the
importance of images since there are many noise images unrelated to the text in
the dataset, which will have a negative impact on model learning. Although some
work attempts to filter low-quality noise images by setting thresholds, relying
on thresholds will inevitably filter out a lot of useful image information.
Therefore, in this work, we focus on whether the negative impact of noisy
images can be reduced without modifying the data. To achieve this goal, we
borrow the idea of Curriculum Learning and propose a Multi-grained
Multi-curriculum Denoising Framework (M2DF), which can achieve denoising by
adjusting the order of training data. Extensive experimental results show that
our framework consistently outperforms state-of-the-art work on three sub-tasks
of MABSA.
Related papers
- Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning [53.766434746801366]
Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet.
Hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information.
Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection.
We propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples.
arXiv Detail & Related papers (2024-07-23T09:00:52Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation [36.43428388918294]
Web-scale training on paired text-image data is becoming increasingly central to multimodal learning.
Standard data filtering approaches fail to remove mismatched text-image pairs.
We propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness.
arXiv Detail & Related papers (2024-03-02T20:36:10Z) - Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for
English to Indian Languages [29.416563233407892]
The study investigates the effectiveness of utilizing multimodal information in Neural Machine Translation (NMT)
Surprisingly, the study finds that images might be redundant in this context.
Experiments translate from English to Hindi, Bengali, and Malayalam, outperforming state-of-the-art benchmarks significantly.
arXiv Detail & Related papers (2023-08-30T14:52:14Z) - Generalizable Denoising of Microscopy Images using Generative
Adversarial Networks and Contrastive Learning [0.0]
We propose a novel framework for few-shot microscopy image denoising.
Our approach combines a generative adversarial network (GAN) trained via contrastive learning (CL) with two structure preserving loss terms.
We demonstrate the effectiveness of our method on three well-known microscopy imaging datasets.
arXiv Detail & Related papers (2023-03-27T13:55:07Z) - Masked Image Training for Generalizable Deep Image Denoising [53.03126421917465]
We present a novel approach to enhance the generalization performance of denoising networks.
Our method involves masking random pixels of the input image and reconstructing the missing information during training.
Our approach exhibits better generalization ability than other deep learning models and is directly applicable to real-world scenarios.
arXiv Detail & Related papers (2023-03-23T09:33:44Z) - Deep Semantic Statistics Matching (D2SM) Denoising Network [70.01091467628068]
We introduce the Deep Semantic Statistics Matching (D2SM) Denoising Network.
It exploits semantic features of pretrained classification networks, then it implicitly matches the probabilistic distribution of clear images at the semantic feature space.
By learning to preserve the semantic distribution of denoised images, we empirically find our method significantly improves the denoising capabilities of networks.
arXiv Detail & Related papers (2022-07-19T14:35:42Z) - Deformed2Self: Self-Supervised Denoising for Dynamic Medical Imaging [0.0]
We propose Deformed2Self, an end-to-end self-supervised deep learning framework for dynamic imaging denoising.
It combines single-image and multi-image denoising to improve image quality and use a spatial transformer network to model motion between different slices.
arXiv Detail & Related papers (2021-06-23T05:50:19Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.