CMSBERT-CLR: Context-driven Modality Shifting BERT with Contrastive
Learning for linguistic, visual, acoustic Representations
- URL: http://arxiv.org/abs/2209.07424v1
- Date: Sun, 21 Aug 2022 08:21:43 GMT
- Title: CMSBERT-CLR: Context-driven Modality Shifting BERT with Contrastive
Learning for linguistic, visual, acoustic Representations
- Authors: Junghun Kim, Jihie Kim
- Abstract summary: We present a Context-driven Modality Shifting BERT with Contrastive Learning for linguistic, visual, acoustic Representations (CMSBERT-CLR)
CMSBERT-CLR incorporates the whole context's non-verbal and verbal information and aligns modalities more effectively through contrastive learning.
In our experiments, we demonstrate that our approach achieves state-of-the-art results.
- Score: 0.7081604594416336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal sentiment analysis has become an increasingly popular research
area as the demand for multimodal online content is growing. For multimodal
sentiment analysis, words can have different meanings depending on the
linguistic context and non-verbal information, so it is crucial to understand
the meaning of the words accordingly. In addition, the word meanings should be
interpreted within the whole utterance context that includes nonverbal
information. In this paper, we present a Context-driven Modality Shifting BERT
with Contrastive Learning for linguistic, visual, acoustic Representations
(CMSBERT-CLR), which incorporates the whole context's non-verbal and verbal
information and aligns modalities more effectively through contrastive
learning. First, we introduce a Context-driven Modality Shifting (CMS) to
incorporate the non-verbal and verbal information within the whole context of
the sentence utterance. Then, for improving the alignment of different
modalities within a common embedding space, we apply contrastive learning.
Furthermore, we use an exponential moving average parameter and label smoothing
as optimization strategies, which can make the convergence of the network more
stable and increase the flexibility of the alignment. In our experiments, we
demonstrate that our approach achieves state-of-the-art results.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning.
It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone.
It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z) - Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition [29.523405624632378]
We introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the challenges of multimodal intent recognition.
Based on the modality-aware prompt and ground truth labels, the proposed TCL constructs augmented samples and employs NT-Xent loss on the label token.
Our method achieves remarkable improvements compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-12-22T13:03:23Z) - Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
Pretraining? [34.609984453754656]
We aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment.
Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark.
arXiv Detail & Related papers (2023-08-24T16:17:40Z) - A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - MCSE: Multimodal Contrastive Learning of Sentence Embeddings [23.630041603311923]
We propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective.
We show that our approach consistently improves the performance across various datasets and pre-trained encoders.
arXiv Detail & Related papers (2022-04-22T21:19:24Z) - EDS-MEMBED: Multi-sense embeddings based on enhanced distributional
semantic structures via a graph walk over word senses [0.0]
We leverage the rich semantic structures in WordNet to enhance the quality of multi-sense embeddings.
We derive new distributional semantic similarity measures for M-SE from prior ones.
We report evaluation results on 11 benchmark datasets involving WSD and Word Similarity tasks.
arXiv Detail & Related papers (2021-02-27T14:36:55Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z) - Improving Machine Reading Comprehension with Contextualized Commonsense
Knowledge [62.46091695615262]
We aim to extract commonsense knowledge to improve machine reading comprehension.
We propose to represent relations implicitly by situating structured knowledge in a context.
We employ a teacher-student paradigm to inject multiple types of contextualized knowledge into a student machine reader.
arXiv Detail & Related papers (2020-09-12T17:20:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.