EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature
Refinement and Regularized Image-Text Alignment
- URL: http://arxiv.org/abs/2312.08548v1
- Date: Wed, 13 Dec 2023 22:20:45 GMT
- Title: EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature
Refinement and Regularized Image-Text Alignment
- Authors: Mykola Lavreniuk, Shariq Farooq Bhat, Matthias M\"uller, Peter Wonka
- Abstract summary: This work builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks.
We develop the Inverse Multi-Attentive Feature Refinement (IMAFR) module which enhances feature learning capabilities.
Second, we propose a novel image-text alignment module for improved feature extraction of the Stable Diffusion backbone.
- Score: 40.328294121805456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents the network architecture EVP (Enhanced Visual Perception).
EVP builds on the previous work VPD which paved the way to use the Stable
Diffusion network for computer vision tasks. We propose two major enhancements.
First, we develop the Inverse Multi-Attentive Feature Refinement (IMAFR) module
which enhances feature learning capabilities by aggregating spatial information
from higher pyramid levels. Second, we propose a novel image-text alignment
module for improved feature extraction of the Stable Diffusion backbone. The
resulting architecture is suitable for a wide variety of tasks and we
demonstrate its performance in the context of single-image depth estimation
with a specialized decoder using classification-based bins and referring
segmentation with an off-the-shelf decoder. Comprehensive experiments conducted
on established datasets show that EVP achieves state-of-the-art results in
single-image depth estimation for indoor (NYU Depth v2, 11.8% RMSE improvement
over VPD) and outdoor (KITTI) environments, as well as referring segmentation
(RefCOCO, 2.53 IoU improvement over ReLA). The code and pre-trained models are
publicly available at https://github.com/Lavreniuk/EVP.
Related papers
- Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation? [61.234412062595155]
We present ANYU, a new virtually augmented version of the NYU depth v2 dataset, designed for monocular depth estimation.
In contrast to the well-known approach where full 3D scenes of a virtual world are utilized to generate artificial datasets, ANYU was created by incorporating RGB-D representations of virtual reality objects.
We show that ANYU improves the monocular depth estimation performance and generalization of deep neural networks with considerably different architectures.
arXiv Detail & Related papers (2024-04-15T05:44:03Z) - BEV$^2$PR: BEV-Enhanced Visual Place Recognition with Structural Cues [44.96177875644304]
We propose a new image-based visual place recognition (VPR) framework by exploiting the structural cues in bird's-eye view (BEV) from a single camera.
The BEV$2$PR framework generates a composite descriptor with both visual cues and spatial awareness based on a single camera.
arXiv Detail & Related papers (2024-03-11T10:46:43Z) - Mask-adaptive Gated Convolution and Bi-directional Progressive Fusion
Network for Depth Completion [3.8558637038709622]
We propose a new model for depth completion based on an encoder-decoder structure.
Our model introduces two key components: the Mask-adaptive Gated Convolution architecture and the Bi-directional Progressive Fusion module.
We achieve remarkable performance in completing depth maps and outperformed existing approaches in terms of accuracy and reliability.
arXiv Detail & Related papers (2024-01-15T02:58:06Z) - Generating Aligned Pseudo-Supervision from Non-Aligned Data for Image
Restoration in Under-Display Camera [84.41316720913785]
We revisit the classic stereo setup for training data collection -- capturing two images of the same scene with one UDC and one standard camera.
The key idea is to "copy" details from a high-quality reference image and "paste" them on the UDC image.
A novel Transformer-based framework generates well-aligned yet high-quality target data for the corresponding UDC input.
arXiv Detail & Related papers (2023-04-12T17:56:42Z) - Global-Local Path Networks for Monocular Depth Estimation with Vertical
CutDepth [24.897377434844266]
We propose a novel structure and training strategy for monocular depth estimation.
We deploy a hierarchical transformer encoder to capture and convey the global context, and design a lightweight yet powerful decoder.
Our network achieves state-of-the-art performance over the challenging depth dataset NYU Depth V2.
arXiv Detail & Related papers (2022-01-19T06:37:21Z) - Deep Direct Volume Rendering: Learning Visual Feature Mappings From
Exemplary Images [57.253447453301796]
We introduce Deep Direct Volume Rendering (DeepDVR), a generalization of Direct Volume Rendering (DVR) that allows for the integration of deep neural networks into the DVR algorithm.
We conceptualize the rendering in a latent color space, thus enabling the use of deep architectures to learn implicit mappings for feature extraction and classification.
Our generalization serves to derive novel volume rendering architectures that can be trained end-to-end directly from examples in image space.
arXiv Detail & Related papers (2021-06-09T23:03:00Z) - Early Bird: Loop Closures from Opposing Viewpoints for
Perceptually-Aliased Indoor Environments [35.663671249819124]
We present novel research that simultaneously addresses viewpoint change and perceptual aliasing.
We show that our integration of VPR with SLAM significantly boosts the performance of VPR, feature correspondence, and pose graph submodules.
For the first time, we demonstrate a localization system capable of state-of-the-art performance despite perceptual aliasing and extreme 180-degree-rotated viewpoint change.
arXiv Detail & Related papers (2020-10-03T20:18:55Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z) - Two-shot Spatially-varying BRDF and Shape Estimation [89.29020624201708]
We propose a novel deep learning architecture with a stage-wise estimation of shape and SVBRDF.
We create a large-scale synthetic training dataset with domain-randomized geometry and realistic materials.
Experiments on both synthetic and real-world datasets show that our network trained on a synthetic dataset can generalize well to real-world images.
arXiv Detail & Related papers (2020-04-01T12:56:13Z) - Video Saliency Prediction Using Enhanced Spatiotemporal Alignment
Network [35.932447204088845]
We develop an effective feature alignment network tailored to video saliency prediction (V)
The network learns to align the features of the neighboring frames to the reference one in a coarse-to-fine manner.
The proposed model is trained end-to-end without any post processing.
arXiv Detail & Related papers (2020-01-02T02:05:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.