ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing
- URL: http://arxiv.org/abs/2503.04545v1
- Date: Thu, 06 Mar 2025 15:33:19 GMT
- Title: ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing
- Authors: Alessandro Scherl, Stefan Thalhammer, Bernhard Neuberger, Wilfried Wöber, José Gracía-Rodríguez,
- Abstract summary: We present a visual servoing approach that leverages pretrained vision transformers for semantic feature extraction.<n>Our approach surpasses classical image-based visual servoing by up to 31.2% relative improvement in perturbed scenarios.<n>Real-world evaluations confirm robust performance in end-effector positioning, industrial box manipulation, and grasping of unseen objects.
- Score: 40.67171259494469
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual servoing enables robots to precisely position their end-effector relative to a target object. While classical methods rely on hand-crafted features and thus are universally applicable without task-specific training, they often struggle with occlusions and environmental variations, whereas learning-based approaches improve robustness but typically require extensive training. We present a visual servoing approach that leverages pretrained vision transformers for semantic feature extraction, combining the advantages of both paradigms while also being able to generalize beyond the provided sample. Our approach achieves full convergence in unperturbed scenarios and surpasses classical image-based visual servoing by up to 31.2\% relative improvement in perturbed scenarios. Even the convergence rates of learning-based methods are matched despite requiring no task- or object-specific training. Real-world evaluations confirm robust performance in end-effector positioning, industrial box manipulation, and grasping of unseen objects using only a reference from the same category. Our code and simulation environment are available at: https://alessandroscherl.github.io/ViT-VS/
Related papers
- Test-Time Canonicalization by Foundation Models for Robust Perception [33.00574202314593]
FOCAL is a test-time, data-driven framework for robust perception.<n>It enhances robustness without re-training or architectural changes.<n>Our experiments demonstrate improved robustness of CLIP and SAM across challenging transformations.
arXiv Detail & Related papers (2025-07-14T15:14:38Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.
Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.
Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration [10.558622685760346]
We present a simple yet effective approach for achieving object generalization through Vision-Language-Action models.
Our method provides a lightweight and scalable way to inject knowledge about the target object.
We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across 100 novel objects with a 64% success rate.
arXiv Detail & Related papers (2025-02-26T15:56:36Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters.
We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model.
Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based
Agile Flight [21.728935597793473]
This work proposes an adaptive multi-pair contrastive learning strategy for visual representation learning that enables zero-shot scene transfer and real-world deployment.
We demonstrate the performance of our approach on the task of agile, vision-based quadrotor flight.
arXiv Detail & Related papers (2023-09-18T15:25:59Z) - CroCo v2: Improved Cross-view Completion Pre-training for Stereo
Matching and Optical Flow [22.161967080759993]
Self-supervised pre-training methods have not yet delivered on dense geometric vision tasks such as stereo matching or optical flow.
We build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene.
We show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques.
arXiv Detail & Related papers (2022-11-18T18:18:53Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Robust Policies via Mid-Level Visual Representations: An Experimental
Study in Manipulation and Navigation [115.4071729927011]
We study the effects of using mid-level visual representations as generic and easy-to-decode perceptual state in an end-to-end RL framework.
We show that they aid generalization, improve sample complexity, and lead to a higher final performance.
In practice, this means that mid-level representations could be used to successfully train policies for tasks where domain randomization and learning-from-scratch failed.
arXiv Detail & Related papers (2020-11-13T00:16:05Z) - Learning View and Target Invariant Visual Servoing for Navigation [9.873635079670093]
We learn viewpoint invariant and target invariant visual servoing for local mobile robot navigation.
We train deep convolutional network controller to reach the desired goal.
arXiv Detail & Related papers (2020-03-04T20:36:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.