Revisiting Multi-Task Visual Representation Learning
- URL: http://arxiv.org/abs/2601.13886v1
- Date: Tue, 20 Jan 2026 11:59:19 GMT
- Title: Revisiting Multi-Task Visual Representation Learning
- Authors: Shangzhe Di, Zhonghua Zhai, Weidi Xie,
- Abstract summary: We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
- Score: 52.93947931352643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.
Related papers
- VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization [87.26383908243878]
We show that vision encoders within Multimodal Large Language Models exhibit deficiencies in their dense feature representations.<n>We propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training.
arXiv Detail & Related papers (2026-02-10T16:08:19Z) - CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks [76.00315860962885]
We propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework for unsupervised pre-training in human-centric visual tasks.<n> CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels.<n>MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability.
arXiv Detail & Related papers (2026-01-19T15:19:28Z) - Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z) - Patch-Level Kernel Alignment for Dense Self-Supervised Learning [7.5866326278176075]
We introduce Patch-level Kernel Alignment (PaKA), a non-parametric, kernel-based approach that improves the dense representations of pretrained vision encoders with a post-(pre)training.<n>Our framework improves dense representations by conducting a lightweight post-training stage on top of a pretrained model.<n>With only 14 hours of additional training on a single GPU, our method achieves state-of-the-art performance across a range of dense vision benchmarks.
arXiv Detail & Related papers (2025-09-06T05:42:32Z) - GoViG: Goal-Conditioned Visual Navigation Instruction Generation [69.79110149746506]
We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions.<n>GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments.
arXiv Detail & Related papers (2025-08-13T07:05:17Z) - MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing [10.207026975603503]
We introduce MoSAiC, a unified framework that jointly optimize intra- and inter-modality contrastive learning with a multi-label supervised contrastive loss.<n>MoSAiC consistently outperforms both fully supervised and self-supervised baselines in terms of accuracy, cluster coherence, and generalization.
arXiv Detail & Related papers (2025-07-11T15:33:51Z) - COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training.<n>It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework.<n>It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z) - Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference [43.474068248379815]
We propose a unified encoder trained on multiple computer vision tasks crucial for urban driving.<n>By integrating diverse visual cues-similar to human perceptual mechanisms-the encoder captures rich features that enhance navigation-related predictions.<n>Our findings highlight two key findings: (1) the unified encoder achieves competitive performance across all visual perception tasks, demonstrating strong generalization capabilities; and (2) for steering estimation, the frozen unified encoder-leveraging dense latent representations-outperforms both its fine-tuned counterpart and the same frozen model pretrained on generic datasets like ImageNet.
arXiv Detail & Related papers (2024-09-16T08:54:03Z) - Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations [6.990891188823598]
We present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision.<n>Our framework is specifically designed to work on web-scraped data by not relying on negative examples in the self-supervised learning path.<n>We evaluate Harmony across various vision downstream tasks and find that it significantly outperforms the baseline CLIP.
arXiv Detail & Related papers (2024-05-23T07:18:08Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of
Semantics and Depth [83.94528876742096]
We tackle the MTL problem of two dense tasks, ie, semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module (CCAM)
In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called AffineMix, and a simple depth augmentation using predicted semantics called ColorAug.
Finally, we validate the performance gain of the proposed method on the Cityscapes dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic
arXiv Detail & Related papers (2022-06-21T17:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.