Related papers: MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving

MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving

URL: http://arxiv.org/abs/2108.12178v1
Date: Fri, 27 Aug 2021 08:47:01 GMT
Title: MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving
Authors: Kai Chen, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung
Abstract summary: Self-supervised learning might be a promising way to improve model performance. Existing SSL methods usually rely on the single-centric-object guarantee. We propose Multi-instance Siamese Network (MultiSiam) to improve generalization ability and achieve state-of-the-art transfer performance.
Score: 45.23708547617418
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autonomous driving has attracted much attention over the years but turns out to be harder than expected, probably due to the difficulty of labeled data collection for model training. Self-supervised learning (SSL), which leverages unlabeled data only for representation learning, might be a promising way to improve model performance. Existing SSL methods, however, usually rely on the single-centric-object guarantee, which may not be applicable for multi-instance datasets such as street scenes. To alleviate this limitation, we raise two issues to solve: (1) how to define positive samples for cross-view consistency and (2) how to measure similarity in multi-instance circumstances. We first adopt an IoU threshold during random cropping to transfer global-inconsistency to local-consistency. Then, we propose two feature alignment methods to enable 2D feature maps for multi-instance similarity measurement. Additionally, we adopt intra-image clustering with self-attention for further mining intra-image similarity and translation-invariance. Experiments show that, when pre-trained on Waymo dataset, our method called Multi-instance Siamese Network (MultiSiam) remarkably improves generalization ability and achieves state-of-the-art transfer performance on autonomous driving benchmarks, including Cityscapes and BDD100K, while existing SSL counterparts like MoCo, MoCo-v2, and BYOL show significant performance drop. By pre-training on SODA10M, a large-scale autonomous driving dataset, MultiSiam exceeds the ImageNet pre-trained MoCo-v2, demonstrating the potential of domain-specific pre-training. Code will be available at https://github.com/KaiChen1998/MultiSiam.

Related papers

MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching [54.740256498985026]
Keypoint detection and description methods often struggle with multimodal data.<n>We propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching.
arXiv Detail & Related papers (2025-01-20T06:56:30Z)
FusionSAM: Visual Multi-Modal Learning with Segment Anything [37.61598617788102]
We introduce the Segment Anything Model (SAM) into multimodal image segmentation for the first time.<n>We propose a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules.<n>Our method significantly outperforms SAM and SAM2 in multimodal autonomous driving scenarios.
arXiv Detail & Related papers (2024-08-26T02:20:55Z)
FedUV: Uniformity and Variance for Heterogeneous Federated Learning [5.9330433627374815]
Federated learning is a promising framework to train neural networks with widely distributed data. Recent work has shown this is due to the final layer of the network being most prone to local bias. We investigate the training dynamics of the classifier by applying SVD to the weights motivated by the observation that freezing weights results in constant singular values.
arXiv Detail & Related papers (2024-02-27T15:53:15Z)
Task-customized Masked AutoEncoder via Mixture of Cluster-conditional Experts [104.9871176044644]
Masked Autoencoder(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training. We propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE) MoCE trains each expert only with semantically relevant images by using cluster-conditional gates.
arXiv Detail & Related papers (2024-02-08T03:46:32Z)
Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text. Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
UniVIP: A Unified Framework for Self-Supervised Visual Pre-training [50.87603616476038]
We propose a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset. Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance. Our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing.
arXiv Detail & Related papers (2022-03-14T10:04:04Z)
Co-training for Deep Object Detection: Comparing Single-modal and Multi-modal Approaches [0.0]
We focus on the use of co-training, a semi-supervised learning (SSL) method, for obtaining self-labeled object bounding boxes (BBs) In particular, we assess the goodness of multi-modal co-training by relying on two different views of an image, namely, appearance (RGB) and estimated depth (D) Our results suggest that in a standard SSL setting (no domain shift, a few human-labeled data) and under virtual-to-real domain shift (many virtual-world labeled data, no human-labeled data) multi-modal co-training outperforms single-modal.
arXiv Detail & Related papers (2021-04-23T14:13:59Z)
Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis [11.368438990334397]
We develop a self-supervised learning strategy to acquire independent unimodal supervisions. We conduct extensive experiments on three public multimodal baseline datasets. Our method achieves comparable performance than human-annotated unimodal labels.
arXiv Detail & Related papers (2021-02-09T14:05:02Z)
Unsupervised Feature Learning by Cross-Level Instance-Group Discrimination [68.83098015578874]
We integrate between-instance similarity into contrastive learning, not directly by instance grouping, but by cross-level discrimination. CLD effectively brings unsupervised learning closer to natural data and real-world applications. New state-of-the-art on self-supervision, semi-supervision, and transfer learning benchmarks, and beats MoCo v2 and SimCLR on every reported performance.
arXiv Detail & Related papers (2020-08-09T21:13:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.