Should we pre-train a decoder in contrastive learning for dense prediction tasks?
- URL: http://arxiv.org/abs/2503.17526v1
- Date: Fri, 21 Mar 2025 20:19:13 GMT
- Title: Should we pre-train a decoder in contrastive learning for dense prediction tasks?
- Authors: Sébastien Quetin, Tapotosh Ghosh, Farhad Maleki,
- Abstract summary: We propose a framework-agnostic adaptation to convert an encoder-only self-supervised learning (SSL) contrastive approach to an efficient encoder-decoder framework.<n>We first update the existing architecture to accommodate a decoder and its respective contrastive loss.<n>We then introduce a weighted encoder-decoder contrastive loss with non-competing objectives that facilitates the joint encoder-decoder architecture pre-training.
- Score: 0.7237068561453082
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive learning in self-supervised settings primarily focuses on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. This conventional approach, however, overlooks the potential benefits of jointly pre-training both the encoder and decoder. In this paper, we propose DeCon: a framework-agnostic adaptation to convert an encoder-only self-supervised learning (SSL) contrastive approach to an efficient encoder-decoder framework that can be pre-trained in a contrastive manner. We first update the existing architecture to accommodate a decoder and its respective contrastive loss. We then introduce a weighted encoder-decoder contrastive loss with non-competing objectives that facilitates the joint encoder-decoder architecture pre-training. We adapt two established contrastive SSL frameworks tailored for dense prediction tasks, achieve new state-of-the-art results in COCO object detection and instance segmentation, and match state-of-the-art performance on Pascal VOC semantic segmentation. We show that our approach allows for pre-training a decoder and enhances the representation power of the encoder and its performance in dense prediction tasks. This benefit holds across heterogeneous decoder architectures between pre-training and fine-tuning and persists in out-of-domain, limited-data scenarios.
Related papers
- Is Pre-training Applicable to the Decoder for Dense Prediction? [13.542355644833544]
We introduce $times$Net, which facilitates a "pre-trained encoder $times$ pre-trained decoder" collaboration through three innovative designs.<n>By simply coupling the pre-trained encoder and pre-trained decoder, $times$Net distinguishes itself as a highly promising approach.<n>Despite its streamlined design, $times$Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation.
arXiv Detail & Related papers (2025-03-05T05:16:28Z) - Downstream-agnostic Adversarial Examples [66.8606539786026]
AdvEncoder is first framework for generating downstream-agnostic universal adversarial examples based on pre-trained encoder.
Unlike traditional adversarial example works, the pre-trained encoder only outputs feature vectors rather than classification labels.
Our results show that an attacker can successfully attack downstream tasks without knowing either the pre-training dataset or the downstream dataset.
arXiv Detail & Related papers (2023-07-23T10:16:47Z) - Challenging Decoder helps in Masked Auto-Encoder Pre-training for Dense
Passage Retrieval [10.905033385938982]
Masked auto-encoder (MAE) pre-training architecture has emerged as the most promising.
We propose a novel token importance aware masking strategy based on pointwise mutual information to intensify the challenge of the decoder.
arXiv Detail & Related papers (2023-05-22T16:27:10Z) - Think Twice before Driving: Towards Scalable Decoders for End-to-End
Autonomous Driving [74.28510044056706]
Existing methods usually adopt the decoupled encoder-decoder paradigm.
In this work, we aim to alleviate the problem by two principles.
We first predict a coarse-grained future position and action based on the encoder features.
Then, conditioned on the position and action, the future scene is imagined to check the ramification if we drive accordingly.
arXiv Detail & Related papers (2023-05-10T15:22:02Z) - Decoder Denoising Pretraining for Semantic Segmentation [46.23441959230505]
We propose a decoder pretraining approach based on denoising.
We find that decoder denoising pretraining on the ImageNet dataset strongly outperforms encoder-only supervised pretraining.
arXiv Detail & Related papers (2022-05-23T16:08:31Z) - Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models.
We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words.
Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z) - EncoderMI: Membership Inference against Pre-trained Encoders in
Contrastive Learning [27.54202989524394]
We proposeMI, the first membership inference method against image encoders pre-trained by contrastive learning.
We evaluateMI on image encoders pre-trained on multiple datasets by ourselves as well as the Contrastive Language-Image Pre-training (CLIP) image encoder, which is pre-trained on 400 million (image, text) pairs collected from the Internet and released by OpenAI.
arXiv Detail & Related papers (2021-08-25T03:00:45Z) - Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder [75.84152924972462]
Many real-world applications use Siamese networks to efficiently match text sequences at scale.
This paper pre-trains language models dedicated to sequence matching in Siamese architectures.
arXiv Detail & Related papers (2021-02-18T08:08:17Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z) - Jointly Optimizing State Operation Prediction and Value Generation for
Dialogue State Tracking [23.828348485513043]
We investigate the problem of multi-domain Dialogue State Tracking (DST) with open vocabulary.
Existing approaches exploit BERT encoder and copy-based RNN decoder, where the encoder predicts the state operation, and the decoder generates new slot values.
We propose a purely Transformer-based framework, where a single BERT works as both the encoder and the decoder.
arXiv Detail & Related papers (2020-10-24T04:54:52Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.