Related papers: ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

URL: http://arxiv.org/abs/2505.21465v1
Date: Tue, 27 May 2025 17:36:23 GMT
Title: ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
Authors: Bozhou Li, Wentao Zhang,
Abstract summary: A prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously.<n>We propose ID-Align, which alleviates these problems by reordering position IDs.<n>Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements.
Score: 24.087014423545067
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: https://github.com/zooblastlbz/ID-Align.

Related papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z)
Images are Worth Variable Length of Representations [13.136831256070343]
Most vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information.<n>We propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens to reconstruct each image.<n>Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality.
arXiv Detail & Related papers (2025-06-04T07:40:33Z)
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models [35.471513870514585]
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models.<n>RoPE variants enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments.<n>We introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory to the linear path of text token indices, forming a cone-like structure.
arXiv Detail & Related papers (2025-05-22T09:05:01Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID [29.560370412849874]
This paper introduces Semantic ID prefix ngram, a novel token parameterization technique that significantly improves the performance of the original Semantic ID.<n>We show that Semantic ID prefix ngram not only addresses embedding instability but also significantly improves tail id modeling, reduces overfitting, and mitigates representation shifts.<n>We also report our experience of integrating Semantic ID into Meta production Ads Ranking system, leading to notable performance gains and enhanced prediction stability in live deployments.
arXiv Detail & Related papers (2025-04-02T21:28:38Z)
ID-Patch: Robust ID Association for Group Photo Personalization [29.38844265790726]
ID-Patch is a novel method that provides robust association between identities and 2D positions.<n>Our approach generates an ID patch and ID embeddings from the same facial features.
arXiv Detail & Related papers (2024-11-20T18:55:28Z)
FlexAttention for Efficient High-Resolution Vision-Language Models [67.82024785677801]
We propose FlexAttention, a flexible attention mechanism for efficient high-resolution vision-language models. A high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized. Experiments on multimodal benchmarks prove that our FlexAttention outperforms existing high-resolution VLMs.
arXiv Detail & Related papers (2024-07-29T17:59:05Z)
Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training [51.87027943520492]
We present a novel paradigm Diffusion-ReID to efficiently augment and generate diverse images based on known identities. Benefiting from our proposed paradigm, we first create a new large-scale person Re-ID dataset Diff-Person, which consists of over 777K images from 5,183 identities.
arXiv Detail & Related papers (2024-06-10T06:26:03Z)
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition. Our method uses the attention mechanism to correlate multiple images within a batch. Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z)
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding [102.07914175196817]
PhotoMaker is an efficient personalized text-to-image generation method. It encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information.
arXiv Detail & Related papers (2023-12-07T17:32:29Z)
MultiRes-NetVLAD: Augmenting Place Recognition Training with Low-Resolution Imagery [28.875236694573815]
We augment NetVLAD representation learning with low-resolution image pyramid encoding. The resultant multi-resolution feature pyramid can be conveniently aggregated through VLAD into a single compact representation. We show that the underlying learnt feature tensor can be combined with existing multi-scale approaches to improve their baseline performance.
arXiv Detail & Related papers (2022-02-18T11:53:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.