Lightweight Portrait Matting via Regional Attention and Refinement
- URL: http://arxiv.org/abs/2311.03770v1
- Date: Tue, 7 Nov 2023 07:14:28 GMT
- Title: Lightweight Portrait Matting via Regional Attention and Refinement
- Authors: Yatao Zhong and Ilya Zharkov
- Abstract summary: We present a lightweight model for high resolution portrait matting.
The model does not use any auxiliary inputs such as trimaps or background captures.
It achieves real time performance for HD videos and near real time for 4K.
- Score: 7.206702064210176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a lightweight model for high resolution portrait matting. The
model does not use any auxiliary inputs such as trimaps or background captures
and achieves real time performance for HD videos and near real time for 4K. Our
model is built upon a two-stage framework with a low resolution network for
coarse alpha estimation followed by a refinement network for local region
improvement. However, a naive implementation of the two-stage model suffers
from poor matting quality if not utilizing any auxiliary inputs. We address the
performance gap by leveraging the vision transformer (ViT) as the backbone of
the low resolution network, motivated by the observation that the tokenization
step of ViT can reduce spatial resolution while retain as much pixel
information as possible. To inform local regions of the context, we propose a
novel cross region attention (CRA) module in the refinement network to
propagate the contextual information across the neighboring regions. We
demonstrate that our method achieves superior results and outperforms other
baselines on three benchmark datasets while only uses $1/20$ of the FLOPS
compared to the existing state-of-the-art model.
Related papers
- DRCT: Saving Image Super-resolution away from Information Bottleneck [7.765333471208582]
Vision Transformer-based approaches for low-level vision tasks have achieved widespread success.
Dense-residual-connected Transformer (DRCT) is proposed to mitigate the loss of spatial information.
Our approach surpasses state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2024-03-31T15:34:45Z) - FocusTune: Tuning Visual Localization through Focus-Guided Sampling [61.79440120153917]
FocusTune is a focus-guided sampling technique to improve the performance of visual localization algorithms.
We demonstrate that FocusTune both improves or matches state-of-the-art performance whilst keeping ACE's appealing low storage and compute requirements.
This combination of high performance and low compute and storage requirements is particularly promising for applications in areas like mobile robotics and augmented reality.
arXiv Detail & Related papers (2023-11-06T04:58:47Z) - Optimization Efficient Open-World Visual Region Recognition [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.
Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
Latent Attention [100.81495948184649]
We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text.
Our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models.
arXiv Detail & Related papers (2022-11-21T18:22:39Z) - CLONeR: Camera-Lidar Fusion for Occupancy Grid-aided Neural
Representations [77.90883737693325]
This paper proposes CLONeR, which significantly improves upon NeRF by allowing it to model large outdoor driving scenes observed from sparse input sensor views.
This is achieved by decoupling occupancy and color learning within the NeRF framework into separate Multi-Layer Perceptrons (MLPs) trained using LiDAR and camera data, respectively.
In addition, this paper proposes a novel method to build differentiable 3D Occupancy Grid Maps (OGM) alongside the NeRF model, and leverage this occupancy grid for improved sampling of points along a ray for rendering in metric space.
arXiv Detail & Related papers (2022-09-02T17:44:50Z) - SALISA: Saliency-based Input Sampling for Efficient Video Object
Detection [58.22508131162269]
We propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection.
We show that SALISA significantly improves the detection of small objects.
arXiv Detail & Related papers (2022-04-05T17:59:51Z) - Monocular Real-Time Volumetric Performance Capture [28.481131687883256]
We present the first approach to volumetric performance capture and novel-view rendering at real-time speed from monocular video.
Our system reconstructs a fully textured 3D human from each frame by leveraging Pixel-Aligned Implicit Function (PIFu)
We also introduce an Online Hard Example Mining (OHEM) technique that effectively suppresses failure modes due to the rare occurrence of challenging examples.
arXiv Detail & Related papers (2020-07-28T04:45:13Z) - Spatial-Spectral Residual Network for Hyperspectral Image
Super-Resolution [82.1739023587565]
We propose a novel spectral-spatial residual network for hyperspectral image super-resolution (SSRNet)
Our method can effectively explore spatial-spectral information by using 3D convolution instead of 2D convolution, which enables the network to better extract potential information.
In each unit, we employ spatial and temporal separable 3D convolution to extract spatial and spectral information, which not only reduces unaffordable memory usage and high computational cost, but also makes the network easier to train.
arXiv Detail & Related papers (2020-01-14T03:34:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.