VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models
- URL: http://arxiv.org/abs/2510.20994v1
- Date: Thu, 23 Oct 2025 20:44:28 GMT
- Title: VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models
- Authors: Jesimon Barreto, Carlos Caetano, André Araujo, William Robson Schwartz,
- Abstract summary: Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning.<n>We introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations.<n>Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models.
- Score: 0.18665975431697424
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.
Related papers
- Visual Autoregressive Modelling for Monocular Depth Estimation [69.01449528371916]
We propose a monocular depth estimation method based on visual autoregressive ( VAR) priors.<n>Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism.<n>We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets.
arXiv Detail & Related papers (2025-12-27T17:08:03Z) - No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning [0.0]
Vision-language models (VLMs) and transfer learning with pre-trained visual models appear as promising techniques to deal with this problem.<n>This paper proposes a novel zero-shot image classification framework that combines a VLM and a pre-trained visual model within a self-learning cycle.
arXiv Detail & Related papers (2025-09-23T12:54:52Z) - DINOv3 [62.31809406012177]
Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures.<n>This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies.<n>DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks.
arXiv Detail & Related papers (2025-08-13T18:00:55Z) - Simplifying Traffic Anomaly Detection with Video Foundation Models [12.999050785284725]
Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures.<n>Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs.<n>We investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance.
arXiv Detail & Related papers (2025-07-12T16:36:49Z) - Sparse autoencoders reveal selective remapping of visual concepts during adaptation [54.82630842681845]
Adapting foundation models for specific purposes has become a standard approach to build machine learning systems.<n>We develop a new Sparse Autoencoder (SAE) for the CLIP vision transformer, named PatchSAE, to extract interpretable concepts.
arXiv Detail & Related papers (2024-12-06T18:59:51Z) - ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.<n>Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.<n>We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - Video Annotator: A framework for efficiently building video classifiers
using vision-language models and active learning [0.0]
Video Annotator (VA) is a framework for annotating, managing, and iterating on video classification datasets.
VA allows for a continuous annotation process, seamlessly integrating data collection and model training.
VA achieves a median 6.8 point improvement in Average Precision relative to the most competitive baseline.
arXiv Detail & Related papers (2024-02-09T17:19:05Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - TaCA: Upgrading Your Visual Foundation Model with Task-agnostic
Compatible Adapter [21.41170708560114]
A growing number of applications based on visual foundation models are emerging.
In situations involving system upgrades, it becomes essential to re-train all downstream modules to adapt to the new foundation model.
We introduce a parameter-efficient and task-agnostic adapter, dubbed TaCA, that facilitates compatibility across distinct foundation models.
arXiv Detail & Related papers (2023-06-22T03:00:24Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.