Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
- URL: http://arxiv.org/abs/2404.12139v1
- Date: Thu, 18 Apr 2024 12:41:33 GMT
- Title: Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
- Authors: Shouwei Ruan, Yinpeng Dong, Hanqing Liu, Yao Huang, Hang Su, Xingxing Wei,
- Abstract summary: We build a dataset of over four million multi-view image-text pairs across more than 100K objects.
We design a novel fine-tuning framework named Omniview-Tuning (OVT)
OVT introduces a Cross-Viewpoint Alignment objective through a minimax-like optimization strategy.
- Score: 32.83187649097727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable success in computer vision and particularly demonstrated superior robustness to distribution shifts of 2D images. However, their robustness under 3D viewpoint variations is still limited, which can hinder the development for real-world applications. This paper successfully addresses this concern while keeping VLPs' original performance by breaking through two primary obstacles: 1) the scarcity of training data and 2) the suboptimal fine-tuning paradigms. To combat data scarcity, we build the Multi-View Caption (MVCap) dataset -- a comprehensive collection of over four million multi-view image-text pairs across more than 100K objects, providing more potential for VLP models to develop generalizable viewpoint-invariant representations. To address the limitations of existing paradigms in performance trade-offs and training efficiency, we design a novel fine-tuning framework named Omniview-Tuning (OVT). Specifically, OVT introduces a Cross-Viewpoint Alignment objective through a minimax-like optimization strategy, which effectively aligns representations of identical objects from diverse viewpoints without causing overfitting. Additionally, OVT fine-tunes VLP models in a parameter-efficient manner, leading to minimal computational cost. Extensive experiments on various VLP models with different architectures validate that OVT significantly improves the models' resilience to viewpoint shifts and keeps the original performance, establishing a pioneering standard for boosting the viewpoint invariance of VLP models.
Related papers
- Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - ED-ViT: Splitting Vision Transformer for Distributed Inference on Edge Devices [13.533267828812455]
We propose a novel Vision Transformer splitting framework, ED-ViT, to execute complex models across multiple edge devices efficiently.
Specifically, we partition Vision Transformer models into several sub-models, where each sub-model is tailored to handle a specific subset of data classes.
We conduct extensive experiments on five datasets with three model structures, demonstrating that our approach significantly reduces inference latency on edge devices.
arXiv Detail & Related papers (2024-10-15T14:38:14Z) - Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization.
Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z) - ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.
Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.
We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities [11.53488611812612]
Recent advancements in Vision-Language (VL) models have sparked interest in their deployment on edge devices.
We introduce EdgeVL, a novel framework that seamlessly integrates dual-modality knowledge distillation and quantization-aware contrastive learning.
Our work represents the first systematic effort to adapt large VL models for edge deployment, showcasing up to 15.4% accuracy improvements on multiple datasets and up to 93-fold reduction in model size.
arXiv Detail & Related papers (2024-03-07T21:34:40Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Towards Viewpoint-Invariant Visual Recognition via Adversarial Training [28.424131496622497]
We propose Viewpoint-Invariant Adrial Training (VIAT) to improve viewpoint robustness of common image classifiers.
VIAT is formulated as a minimax optimization problem, where the inner recognition characterizes diverse adversarial viewpoints.
To further improve the generalization performance, a distribution sharing strategy is introduced.
arXiv Detail & Related papers (2023-07-16T07:55:42Z) - Position-guided Text Prompt for Vision-Language Pre-training [121.15494549650548]
We propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with Vision-Language Pre-Training.
PTP reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object.
PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot.
arXiv Detail & Related papers (2022-12-19T18:55:43Z) - Towards a Unified View on Visual Parameter-Efficient Transfer Learning [96.99924127527002]
We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off.
An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
arXiv Detail & Related papers (2022-10-03T09:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.