Related papers: Revisiting Multi-Task Visual Representation Learning

Revisiting Multi-Task Visual Representation Learning

URL: http://arxiv.org/abs/2601.13886v1
Date: Tue, 20 Jan 2026 11:59:19 GMT
Title: Revisiting Multi-Task Visual Representation Learning
Authors: Shangzhe Di, Zhonghua Zhai, Weidi Xie,
Abstract summary: We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
Score: 52.93947931352643
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.

Related papers

VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization [87.26383908243878]
We show that vision encoders within Multimodal Large Language Models exhibit deficiencies in their dense feature representations.<n>We propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training.
arXiv Detail & Related papers (2026-02-10T16:08:19Z)
CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks [76.00315860962885]
We propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework for unsupervised pre-training in human-centric visual tasks.<n> CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels.<n>MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability.
arXiv Detail & Related papers (2026-01-19T15:19:28Z)
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z)
Patch-Level Kernel Alignment for Dense Self-Supervised Learning [7.5866326278176075]
We introduce Patch-level Kernel Alignment (PaKA), a non-parametric, kernel-based approach that improves the dense representations of pretrained vision encoders with a post-(pre)training.<n>Our framework improves dense representations by conducting a lightweight post-training stage on top of a pretrained model.<n>With only 14 hours of additional training on a single GPU, our method achieves state-of-the-art performance across a range of dense vision benchmarks.
arXiv Detail & Related papers (2025-09-06T05:42:32Z)
GoViG: Goal-Conditioned Visual Navigation Instruction Generation [69.79110149746506]
We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions.<n>GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments.
arXiv Detail & Related papers (2025-08-13T07:05:17Z)
MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing [10.207026975603503]
We introduce MoSAiC, a unified framework that jointly optimize intra- and inter-modality contrastive learning with a multi-label supervised contrastive loss.<n>MoSAiC consistently outperforms both fully supervised and self-supervised baselines in terms of accuracy, cluster coherence, and generalization.
arXiv Detail & Related papers (2025-07-11T15:33:51Z)
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training.<n>It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework.<n>It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z)
Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference [43.474068248379815]
We propose a unified encoder trained on multiple computer vision tasks crucial for urban driving.<n>By integrating diverse visual cues-similar to human perceptual mechanisms-the encoder captures rich features that enhance navigation-related predictions.<n>Our findings highlight two key findings: (1) the unified encoder achieves competitive performance across all visual perception tasks, demonstrating strong generalization capabilities; and (2) for steering estimation, the frozen unified encoder-leveraging dense latent representations-outperforms both its fine-tuned counterpart and the same frozen model pretrained on generic datasets like ImageNet.
arXiv Detail & Related papers (2024-09-16T08:54:03Z)
Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations [6.990891188823598]
We present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision.<n>Our framework is specifically designed to work on web-scraped data by not relying on negative examples in the self-supervised learning path.<n>We evaluate Harmony across various vision downstream tasks and find that it significantly outperforms the baseline CLIP.
arXiv Detail & Related papers (2024-05-23T07:18:08Z)
Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z)
Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of Semantics and Depth [83.94528876742096]
We tackle the MTL problem of two dense tasks, ie, semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module (CCAM) In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called AffineMix, and a simple depth augmentation using predicted semantics called ColorAug. Finally, we validate the performance gain of the proposed method on the Cityscapes dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic
arXiv Detail & Related papers (2022-06-21T17:40:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.