Related papers: TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

URL: http://arxiv.org/abs/2305.04474v4
Date: Sun, 28 Sep 2025 08:22:30 GMT
Title: TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Authors: Chaoya Jiang, Haiyang Xu, Chenliang Li, Miang Yan, Wei Ye, Shikun Zhang, Bin Bi, Songfang Huang,
Abstract summary: We propose an efficient vision-and-language pre-training model with textbfText-textbfRelevant textbfImage textbfPatch textbfSelection, namely TRIPS.<n> TRIPS reduces the visual sequence progressively with a text-guided patch-selection layer in the visual backbone for efficient training and inference.
Score: 61.0662744915659
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the long visual sequence. To tackle this problem, in this paper, we propose an efficient vision-and-language pre-training model with \textbf{T}ext-\textbf{R}elevant \textbf{I}mage \textbf{P}atch \textbf{S}election, namely TRIPS, which reduces the visual sequence progressively with a text-guided patch-selection layer in the visual backbone for efficient training and inference. The patch-selection layer can dynamically compute text-dependent visual attention to identify the attentive image tokens with text guidance and fuse inattentive ones in an end-to-end manner. Meanwhile, TRIPS does not introduce extra parameters to ViTs. Experimental results on a variety of popular benchmark datasets demonstrate that TRIPS gain a speedup of 40\% over previous similar VLP models, yet with competitive or better downstream task performance.

Related papers

Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs [88.68484904214142]
We introduce Patch-as-Decodable Token (PaDT) to generate both textual and diverse visual outputs.<n>Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images.<n>We show PaDT consistently achieves state-of-the-art performance, even compared with significantly larger MLLM models.
arXiv Detail & Related papers (2025-10-02T12:23:57Z)
HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models [23.98782884568504]
We propose Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models (HiViS)<n>HiViS is an explicit-implicit input decomposition framework that alleviates the inefficiency of Speculative Decoding in Vision-Language Models.<n>Our approach compresses the prefill sequence length of the drafter to only 0.7%-1.3% of the target VLM's input.
arXiv Detail & Related papers (2025-09-28T15:05:21Z)
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training [78.60953331455565]
PRIOR is a vision-language pre-training approach that prioritizes image-related tokens through differential weighting in the NTP loss.<n>We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP.
arXiv Detail & Related papers (2025-05-13T21:27:52Z)
Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP. Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z)
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection [66.72992463712299]
Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training models. Previous research has demonstrated the efficacy of ViTs, but they still struggle with computational inefficiencies caused by lengthy visual sequences. We introduce TRIPS, which reduces the visual sequence using a text-guided patch-selection layer in the visual backbone. Our experimental results reveal that TRIPS delivers a 40% speedup, while maintaining competitive or superior performance on downstream tasks.
arXiv Detail & Related papers (2024-01-11T14:31:30Z)
Improving Contrastive Learning of Sentence Embeddings with Focal-InfoNCE [13.494159547236425]
This study introduces an unsupervised contrastive learning framework that combines SimCSE with hard negative mining. The proposed focal-InfoNCE function introduces self-paced modulation terms in the contrastive objective, downweighting the loss associated with easy negatives and encouraging the model focusing on hard negatives.
arXiv Detail & Related papers (2023-10-10T18:15:24Z)
Identical and Fraternal Twins: Fine-Grained Semantic Contrastive Learning of Sentence Representations [6.265789210037749]
We introduce a novel Identical and Fraternal Twins of Contrastive Learning framework, capable of simultaneously adapting to various positive pairs generated by different augmentation techniques. We also present proof-of-concept experiments combined with the contrastive objective to prove the validity of the proposed Twins Loss.
arXiv Detail & Related papers (2023-07-20T15:02:42Z)
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization [89.52943129132217]
We propose a Bottom-Up Patch Summarization approach named BUS to learn a concise summary of lengthy visual token sequences efficiently. We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction. This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness.
arXiv Detail & Related papers (2023-07-17T14:08:17Z)
Exploiting Pseudo Image Captions for Multimodal Summarization [26.033681302592207]
Cross-modal contrastive learning in vision language pretraining faces the challenge of (partial) false negatives. We propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images.
arXiv Detail & Related papers (2023-05-09T14:47:25Z)
Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search [17.360982091304137]
Text-based Person Search (TPS) is targeted on retrieving pedestrians to match text descriptions instead of query images. Recent Vision-Language Pre-training models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains. However, existing TPS methods only utilize pre-trained visual encoders, neglecting the corresponding textual representation.
arXiv Detail & Related papers (2023-03-08T10:41:22Z)
Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast [34.58856143210749]
We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. We propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives.
arXiv Detail & Related papers (2022-04-28T07:28:56Z)
Robust Contrastive Learning against Noisy Views [79.71880076439297]
We propose a new contrastive loss function that is robust against noisy views. We show that our approach provides consistent improvements over the state-of-the-art image, video, and graph contrastive learning benchmarks.
arXiv Detail & Related papers (2022-01-12T05:24:29Z)
Max-Margin Contrastive Learning [120.32963353348674]
We present max-margin contrastive learning (MMCL) for unsupervised representation learning. Our approach selects negatives as the sparse support vectors obtained via a quadratic optimization problem. We validate our approach on standard vision benchmark datasets, demonstrating better performance in unsupervised representation learning.
arXiv Detail & Related papers (2021-12-21T18:56:54Z)
Revisiting Contrastive Learning through the Lens of Neighborhood Component Analysis: an Integrated Framework [70.84906094606072]
We show a new methodology to design integrated contrastive losses that could simultaneously achieve good accuracy and robustness on downstream tasks. With the integrated framework, we achieve up to 6% improvement on the standard accuracy and 17% improvement on the adversarial accuracy.
arXiv Detail & Related papers (2021-12-08T18:54:11Z)
Incremental False Negative Detection for Contrastive Learning [95.68120675114878]
We introduce a novel incremental false negative detection for self-supervised contrastive learning. During contrastive learning, we discuss two strategies to explicitly remove the detected false negatives. Our proposed method outperforms other self-supervised contrastive learning frameworks on multiple benchmarks within a limited compute.
arXiv Detail & Related papers (2021-06-07T15:29:14Z)
Solving Inefficiency of Self-supervised Representation Learning [87.30876679780532]
Existing contrastive learning methods suffer from very low learning efficiency. Under-clustering and over-clustering problems are major obstacles to learning efficiency. We propose a novel self-supervised learning framework using a median triplet loss.
arXiv Detail & Related papers (2021-04-18T07:47:10Z)
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision [10.584604416749965]
We present a minimal Vision-and-Language Transformer (ViLT) model for vision-and-language downstream tasks. ViLT is monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs.
arXiv Detail & Related papers (2021-02-05T18:36:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.