Memory Efficient Matting with Adaptive Token Routing
- URL: http://arxiv.org/abs/2412.10702v2
- Date: Tue, 17 Dec 2024 14:37:34 GMT
- Title: Memory Efficient Matting with Adaptive Token Routing
- Authors: Yiheng Lin, Yihan Hu, Chenyi Zhang, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei,
- Abstract summary: Transformer-based models have recently achieved outstanding performance in image matting.
MeMatte is a textbfmemory-textbfefficient textbfmatting framework for processing high-resolution images.
- Score: 73.09131141304984
- License:
- Abstract: Transformer-based models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a \textbf{m}emory-\textbf{e}fficient \textbf{m}atting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of $4872\times6017$. This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88% and latency by 50% on the Composition-1K benchmark. Our code is available at https://github.com/linyiheng123/MEMatte.
Related papers
- MINIMA: Modality Invariant Image Matching [52.505282811925454]
We present MINIMA, a unified image matching framework for multiple cross-modal cases.
We scale up the modalities from cheap but rich RGB-only matching data, by means of generative models.
With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability.
arXiv Detail & Related papers (2024-12-27T02:39:50Z) - Efficient and Discriminative Image Feature Extraction for Universal Image Retrieval [1.907072234794597]
We develop a framework for a universal feature extractor that provides strong semantic image representations across various domains.
We achieve near state-of-the-art results on the Google Universal Image Embedding Challenge, with a mMP@5 of 0.721.
Compared to methods with similar computational requirements, we outperform the previous state of the art by 3.3 percentage points.
arXiv Detail & Related papers (2024-09-20T13:53:13Z) - Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models [44.437693135170576]
We propose a new framework, LMM with Sophisticated Tasks, Local image compression, and Mixture of global Experts (SliME)
We extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks.
The proposed method achieves leading performance across various benchmarks with only 2 million training data.
arXiv Detail & Related papers (2024-06-12T17:59:49Z) - Toward Real Text Manipulation Detection: New Dataset and New Solution [58.557504531896704]
High costs associated with professional text manipulation limit the availability of real-world datasets.
We present the Real Text Manipulation dataset, encompassing 14,250 text images.
Our contributions aim to propel advancements in real-world text tampering detection.
arXiv Detail & Related papers (2023-12-12T02:10:16Z) - Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text
Image Super-Resolution [22.60056946339325]
We propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling.
The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features.
We demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy.
arXiv Detail & Related papers (2023-09-16T08:12:12Z) - DiT: Efficient Vision Transformers with Dynamic Token Routing [37.808078064528374]
We propose a data-dependent token routing strategy to elaborate the routing paths of image tokens for Dynamic Vision Transformer, dubbed DiT.
The proposed framework generates a data-dependent path per token, adapting to the object scales and visual discrimination of tokens.
In experiments, our DiT achieves superior performance and favorable complexity/accuracy trade-offs than many SoTA methods on ImageNet classification, object detection, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2023-08-07T08:55:48Z) - Multi-interactive Feature Learning and a Full-time Multi-modality
Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation.
We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z) - Vision Transformer with Super Token Sampling [93.70963123497327]
Vision transformer has achieved impressive performance for many vision tasks.
It may suffer from high redundancy in capturing local features for shallow layers.
Super tokens attempt to provide a semantically meaningful tessellation of visual content.
arXiv Detail & Related papers (2022-11-21T03:48:13Z) - Bridging Composite and Real: Towards End-to-end Deep Image Matting [88.79857806542006]
We study the roles of semantics and details for image matting.
We propose a novel Glance and Focus Matting network (GFM), which employs a shared encoder and two separate decoders.
Comprehensive empirical studies have demonstrated that GFM outperforms state-of-the-art methods.
arXiv Detail & Related papers (2020-10-30T10:57:13Z) - High-Resolution Deep Image Matting [39.72708676319803]
HDMatt is a first deep learning based image matting approach for high-resolution inputs.
Our proposed method sets new state-of-the-art performance on Adobe Image Matting and AlphaMatting benchmarks.
arXiv Detail & Related papers (2020-09-14T17:53:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.