Understanding Dark Scenes by Contrasting Multi-Modal Observations
- URL: http://arxiv.org/abs/2308.12320v2
- Date: Sat, 18 Nov 2023 07:19:51 GMT
- Title: Understanding Dark Scenes by Contrasting Multi-Modal Observations
- Authors: Xiaoyu Dong and Naoto Yokoya
- Abstract summary: We introduce a supervised multi-modal contrastive learning approach to increase the semantic discriminability of the learned multi-modal feature spaces.
Cross-modal contrast encourages same-class embeddings from across the two modalities to be closer.
The intra-modal contrast forces same-class or different-class embeddings within each modality to be together or apart.
- Score: 20.665687608385625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding dark scenes based on multi-modal image data is challenging, as
both the visible and auxiliary modalities provide limited semantic information
for the task. Previous methods focus on fusing the two modalities but neglect
the correlations among semantic classes when minimizing losses to align pixels
with labels, resulting in inaccurate class predictions. To address these
issues, we introduce a supervised multi-modal contrastive learning approach to
increase the semantic discriminability of the learned multi-modal feature
spaces by jointly performing cross-modal and intra-modal contrast under the
supervision of the class correlations. The cross-modal contrast encourages
same-class embeddings from across the two modalities to be closer and pushes
different-class ones apart. The intra-modal contrast forces same-class or
different-class embeddings within each modality to be together or apart. We
validate our approach on a variety of tasks that cover diverse light conditions
and image modalities. Experiments show that our approach can effectively
enhance dark scene understanding based on multi-modal images with limited
semantics by shaping semantic-discriminative feature spaces. Comparisons with
previous methods demonstrate our state-of-the-art performance. Code and
pretrained models are available at https://github.com/palmdong/SMMCL.
Related papers
- Turbo your multi-modal classification with contrastive learning [17.983460380784337]
In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding.
Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality.
With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training.
arXiv Detail & Related papers (2024-09-14T03:15:34Z) - Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images.
Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features.
We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Deep Intra-Image Contrastive Learning for Weakly Supervised One-Step
Person Search [98.2559247611821]
We present a novel deep intra-image contrastive learning using a Siamese network.
Our method achieves a state-of-the-art performance among weakly supervised one-step person search approaches.
arXiv Detail & Related papers (2023-02-09T12:45:20Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix [59.25846149124199]
This paper proposes a data augmentation method, namely cross-modal CutMix.
CMC transforms natural sentences from the textual view into a multi-modal view.
By attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising.
arXiv Detail & Related papers (2022-06-17T17:56:47Z) - Contrastive Learning of Visual-Semantic Embeddings [4.7464518249313805]
We propose two loss functions based on normalized cross-entropy to perform the task of learning joint visual-semantic embedding.
We compare our results with existing visual-semantic embedding methods on cross-modal image-to-text and text-to-image retrieval tasks.
arXiv Detail & Related papers (2021-10-17T17:28:04Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.