MultiSiam: Self-supervised Multi-instance Siamese Representation
Learning for Autonomous Driving
- URL: http://arxiv.org/abs/2108.12178v1
- Date: Fri, 27 Aug 2021 08:47:01 GMT
- Title: MultiSiam: Self-supervised Multi-instance Siamese Representation
Learning for Autonomous Driving
- Authors: Kai Chen, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung
- Abstract summary: Self-supervised learning might be a promising way to improve model performance.
Existing SSL methods usually rely on the single-centric-object guarantee.
We propose Multi-instance Siamese Network (MultiSiam) to improve generalization ability and achieve state-of-the-art transfer performance.
- Score: 45.23708547617418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous driving has attracted much attention over the years but turns out
to be harder than expected, probably due to the difficulty of labeled data
collection for model training. Self-supervised learning (SSL), which leverages
unlabeled data only for representation learning, might be a promising way to
improve model performance. Existing SSL methods, however, usually rely on the
single-centric-object guarantee, which may not be applicable for multi-instance
datasets such as street scenes. To alleviate this limitation, we raise two
issues to solve: (1) how to define positive samples for cross-view consistency
and (2) how to measure similarity in multi-instance circumstances. We first
adopt an IoU threshold during random cropping to transfer global-inconsistency
to local-consistency. Then, we propose two feature alignment methods to enable
2D feature maps for multi-instance similarity measurement. Additionally, we
adopt intra-image clustering with self-attention for further mining intra-image
similarity and translation-invariance. Experiments show that, when pre-trained
on Waymo dataset, our method called Multi-instance Siamese Network (MultiSiam)
remarkably improves generalization ability and achieves state-of-the-art
transfer performance on autonomous driving benchmarks, including Cityscapes and
BDD100K, while existing SSL counterparts like MoCo, MoCo-v2, and BYOL show
significant performance drop. By pre-training on SODA10M, a large-scale
autonomous driving dataset, MultiSiam exceeds the ImageNet pre-trained MoCo-v2,
demonstrating the potential of domain-specific pre-training. Code will be
available at https://github.com/KaiChen1998/MultiSiam.
Related papers
- MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching [54.740256498985026]
Keypoint detection and description methods often struggle with multimodal data.<n>We propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching.
arXiv Detail & Related papers (2025-01-20T06:56:30Z) - FusionSAM: Visual Multi-Modal Learning with Segment Anything [37.61598617788102]
We introduce the Segment Anything Model (SAM) into multimodal image segmentation for the first time.<n>We propose a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules.<n>Our method significantly outperforms SAM and SAM2 in multimodal autonomous driving scenarios.
arXiv Detail & Related papers (2024-08-26T02:20:55Z) - FedUV: Uniformity and Variance for Heterogeneous Federated Learning [5.9330433627374815]
Federated learning is a promising framework to train neural networks with widely distributed data.
Recent work has shown this is due to the final layer of the network being most prone to local bias.
We investigate the training dynamics of the classifier by applying SVD to the weights motivated by the observation that freezing weights results in constant singular values.
arXiv Detail & Related papers (2024-02-27T15:53:15Z) - Task-customized Masked AutoEncoder via Mixture of Cluster-conditional
Experts [104.9871176044644]
Masked Autoencoder(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training.
We propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE)
MoCE trains each expert only with semantically relevant images by using cluster-conditional gates.
arXiv Detail & Related papers (2024-02-08T03:46:32Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - UniVIP: A Unified Framework for Self-Supervised Visual Pre-training [50.87603616476038]
We propose a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset.
Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance.
Our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing.
arXiv Detail & Related papers (2022-03-14T10:04:04Z) - Co-training for Deep Object Detection: Comparing Single-modal and
Multi-modal Approaches [0.0]
We focus on the use of co-training, a semi-supervised learning (SSL) method, for obtaining self-labeled object bounding boxes (BBs)
In particular, we assess the goodness of multi-modal co-training by relying on two different views of an image, namely, appearance (RGB) and estimated depth (D)
Our results suggest that in a standard SSL setting (no domain shift, a few human-labeled data) and under virtual-to-real domain shift (many virtual-world labeled data, no human-labeled data) multi-modal co-training outperforms single-modal.
arXiv Detail & Related papers (2021-04-23T14:13:59Z) - Learning Modality-Specific Representations with Self-Supervised
Multi-Task Learning for Multimodal Sentiment Analysis [11.368438990334397]
We develop a self-supervised learning strategy to acquire independent unimodal supervisions.
We conduct extensive experiments on three public multimodal baseline datasets.
Our method achieves comparable performance than human-annotated unimodal labels.
arXiv Detail & Related papers (2021-02-09T14:05:02Z) - Unsupervised Feature Learning by Cross-Level Instance-Group
Discrimination [68.83098015578874]
We integrate between-instance similarity into contrastive learning, not directly by instance grouping, but by cross-level discrimination.
CLD effectively brings unsupervised learning closer to natural data and real-world applications.
New state-of-the-art on self-supervision, semi-supervision, and transfer learning benchmarks, and beats MoCo v2 and SimCLR on every reported performance.
arXiv Detail & Related papers (2020-08-09T21:13:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.