ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
- URL: http://arxiv.org/abs/2505.21465v1
- Date: Tue, 27 May 2025 17:36:23 GMT
- Title: ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
- Authors: Bozhou Li, Wentao Zhang,
- Abstract summary: A prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously.<n>We propose ID-Align, which alleviates these problems by reordering position IDs.<n>Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements.
- Score: 24.087014423545067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: https://github.com/zooblastlbz/ID-Align.
Related papers
- TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement [87.82338951215131]
TokenAR is a simple but effective token-level enhancement mechanism to address reference identity confusion problem.<n>Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens.<n>The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.
arXiv Detail & Related papers (2025-10-18T03:36:26Z) - ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution [71.69364653858447]
Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs.<n>We propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying complexities using different numbers of vision tokens.<n> Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities.
arXiv Detail & Related papers (2025-10-14T17:58:10Z) - Personalized Face Super-Resolution with Identity Decoupling and Fitting [50.473357681579664]
In extreme degradation scenarios, critical attributes and ID information are often severely lost in the input image.<n>Existing methods tend to generate hallucinated faces under such conditions, producing restored images lacking authentic ID constraints.<n>We propose a novel FSR method with Identity Decoupling and Fitting (IDFSR) to enhance ID restoration under large scaling factors.
arXiv Detail & Related papers (2025-08-13T02:33:11Z) - VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z) - Images are Worth Variable Length of Representations [13.136831256070343]
Most vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information.<n>We propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens to reconstruct each image.<n>Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality.
arXiv Detail & Related papers (2025-06-04T07:40:33Z) - Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models [35.471513870514585]
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models.<n>RoPE variants enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments.<n>We introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory to the linear path of text token indices, forming a cone-like structure.
arXiv Detail & Related papers (2025-05-22T09:05:01Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID [29.560370412849874]
This paper introduces Semantic ID prefix ngram, a novel token parameterization technique that significantly improves the performance of the original Semantic ID.<n>We show that Semantic ID prefix ngram not only addresses embedding instability but also significantly improves tail id modeling, reduces overfitting, and mitigates representation shifts.<n>We also report our experience of integrating Semantic ID into Meta production Ads Ranking system, leading to notable performance gains and enhanced prediction stability in live deployments.
arXiv Detail & Related papers (2025-04-02T21:28:38Z) - When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning [31.696397337675847]
Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images.<n>We propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration.<n>Our method outperforms existing high-resolution strategies on four datasets using the same data.
arXiv Detail & Related papers (2025-03-10T17:51:16Z) - ID-Patch: Robust ID Association for Group Photo Personalization [29.38844265790726]
ID-Patch is a novel method that provides robust association between identities and 2D positions.<n>Our approach generates an ID patch and ID embeddings from the same facial features.
arXiv Detail & Related papers (2024-11-20T18:55:28Z) - FlexAttention for Efficient High-Resolution Vision-Language Models [67.82024785677801]
We propose FlexAttention, a flexible attention mechanism for efficient high-resolution vision-language models.
A high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized.
Experiments on multimodal benchmarks prove that our FlexAttention outperforms existing high-resolution VLMs.
arXiv Detail & Related papers (2024-07-29T17:59:05Z) - Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training [51.87027943520492]
We present a novel paradigm Diffusion-ReID to efficiently augment and generate diverse images based on known identities.
Benefiting from our proposed paradigm, we first create a new large-scale person Re-ID dataset Diff-Person, which consists of over 777K images from 5,183 identities.
arXiv Detail & Related papers (2024-06-10T06:26:03Z) - CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding [102.07914175196817]
PhotoMaker is an efficient personalized text-to-image generation method.
It encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information.
arXiv Detail & Related papers (2023-12-07T17:32:29Z) - MultiRes-NetVLAD: Augmenting Place Recognition Training with
Low-Resolution Imagery [28.875236694573815]
We augment NetVLAD representation learning with low-resolution image pyramid encoding.
The resultant multi-resolution feature pyramid can be conveniently aggregated through VLAD into a single compact representation.
We show that the underlying learnt feature tensor can be combined with existing multi-scale approaches to improve their baseline performance.
arXiv Detail & Related papers (2022-02-18T11:53:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.