BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up
Patch Summarization
- URL: http://arxiv.org/abs/2307.08504v2
- Date: Sat, 24 Feb 2024 03:54:37 GMT
- Title: BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up
Patch Summarization
- Authors: Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan,
Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang
- Abstract summary: We propose a Bottom-Up Patch Summarization approach named BUS to learn a concise summary of lengthy visual token sequences efficiently.
We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction.
This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness.
- Score: 89.52943129132217
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models have
demonstrated impressive performance in various tasks. However, the lengthy
visual token sequences fed into ViT can lead to training inefficiency and
ineffectiveness. Existing efforts address the challenge by either bottom-level
patch extraction in the ViT backbone or top-level patch abstraction outside,
not balancing training efficiency and effectiveness well. Inspired by text
summarization in natural language processing, we propose a Bottom-Up Patch
Summarization approach named BUS, coordinating bottom-level extraction and
top-level abstraction to learn a concise summary of lengthy visual token
sequences efficiently. Specifically, We incorporate a Text-Semantics-Aware
Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual
token extraction and then attach a flexible Transformer-based Patch Abstraction
Decoder (PAD) upon the backbone for top-level visual abstraction. This
bottom-up collaboration enables our BUS to yield high training efficiency while
maintaining or even improving effectiveness. We evaluate our approach on
various visual-language understanding and generation tasks and show competitive
downstream task performance while boosting the training efficiency by 50\%.
Additionally, our model achieves state-of-the-art performance on many
downstream tasks by increasing input image resolution without increasing
computational costs over baselines.
Related papers
- Patch Ranking: Efficient CLIP by Learning to Rank Local Patches [11.225834286969283]
Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP.
We propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking.
We successfully reduced 40% of patch tokens in CLIP's ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets.
arXiv Detail & Related papers (2024-09-22T22:04:26Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection [66.72992463712299]
Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training models.
Previous research has demonstrated the efficacy of ViTs, but they still struggle with computational inefficiencies caused by lengthy visual sequences.
We introduce TRIPS, which reduces the visual sequence using a text-guided patch-selection layer in the visual backbone.
Our experimental results reveal that TRIPS delivers a 40% speedup, while maintaining competitive or superior performance on downstream tasks.
arXiv Detail & Related papers (2024-01-11T14:31:30Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Learning Expressive Prompting With Residuals for Vision Transformers [11.342913284654706]
We present Expressive Prompts with Residuals (EXPRES) which modifies the prompt learning paradigm specifically for effective adaptation of vision transformers (ViT)
We apply EXPRES for image classification, few shot learning, and semantic segmentation, and show our method is capable of achieving state of the art prompt tuning on 3/3 categories of the VTAB benchmark.
arXiv Detail & Related papers (2023-03-27T20:47:01Z) - Exploiting the Textual Potential from Vision-Language Pre-training for
Text-based Person Search [17.360982091304137]
Text-based Person Search (TPS) is targeted on retrieving pedestrians to match text descriptions instead of query images.
Recent Vision-Language Pre-training models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains.
However, existing TPS methods only utilize pre-trained visual encoders, neglecting the corresponding textual representation.
arXiv Detail & Related papers (2023-03-08T10:41:22Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.