Related papers: Efficient stereo matching on embedded GPUs with zero-means cross correlation

Efficient stereo matching on embedded GPUs with zero-means cross correlation

URL: http://arxiv.org/abs/2212.00476v1
Date: Thu, 1 Dec 2022 13:03:38 GMT
Title: Efficient stereo matching on embedded GPUs with zero-means cross correlation
Authors: Qiong Chang, Aolong Zha, Weimin Wang, Xin Liu, Masaki Onishi, Lei Lei, Meng Joo Er, Tsutomu Maruyama
Abstract summary: We propose a novel acceleration approach for the zero-means normalized cross correlation (ZNCC) matching cost calculation algorithm on a Jetson Tx2 embedded GPU. In our method for accelerating ZNCC, target images are scanned in a zigzag fashion to efficiently reuse one pixel's computation for its neighboring pixels. Our system show real-time processing speed of 32 fps, on a Jetson Tx2 GPU for 1,280x384 pixel images with a maximum disparity of 128.
Score: 8.446808526407738
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mobile stereo-matching systems have become an important part of many applications, such as automated-driving vehicles and autonomous robots. Accurate stereo-matching methods usually lead to high computational complexity; however, mobile platforms have only limited hardware resources to keep their power consumption low; this makes it difficult to maintain both an acceptable processing speed and accuracy on mobile platforms. To resolve this trade-off, we herein propose a novel acceleration approach for the well-known zero-means normalized cross correlation (ZNCC) matching cost calculation algorithm on a Jetson Tx2 embedded GPU. In our method for accelerating ZNCC, target images are scanned in a zigzag fashion to efficiently reuse one pixel's computation for its neighboring pixels; this reduces the amount of data transmission and increases the utilization of on-chip registers, thus increasing the processing speed. As a result, our method is 2X faster than the traditional image scanning method, and 26% faster than the latest NCC method. By combining this technique with the domain transformation (DT) algorithm, our system show real-time processing speed of 32 fps, on a Jetson Tx2 GPU for 1,280x384 pixel images with a maximum disparity of 128. Additionally, the evaluation results on the KITTI 2015 benchmark show that our combined system is more accurate than the same algorithm combined with census by 7.26%, while maintaining almost the same processing speed.

Related papers

Speedy MASt3R [68.47052557089631]
MASt3R redefines image matching as a 3D task by leveraging DUSt3R and introducing a fast reciprocal matching scheme. Fast MASt3R achieves a 54% reduction in inference time (198 ms to 91 ms per image pair) without sacrificing accuracy. This advancement enables real-time 3D understanding, benefiting applications like mixed reality navigation and large-scale 3D scene reconstruction.
arXiv Detail & Related papers (2025-03-13T03:56:22Z)
EDCSSM: Edge Detection with Convolutional State Space Model [3.649463841174485]
Edge detection in images is the foundation of many complex tasks in computer graphics. Due to the feature loss caused by multi-layer convolution and pooling architectures, learning-based edge detection models often produce thick edges. This paper presents an edge detection algorithm which effectively addresses the aforementioned issues.
arXiv Detail & Related papers (2024-09-03T05:13:25Z)
SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications [98.90623605283564]
We introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. We build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.
arXiv Detail & Related papers (2023-03-27T17:59:58Z)
CoordFill: Efficient High-Resolution Image Inpainting via Parameterized Coordinate Querying [52.91778151771145]
In this paper, we try to break the limitations for the first time thanks to the recent development of continuous implicit representation. Experiments show that the proposed method achieves real-time performance on the 2048$times$2048 images using a single GTX 2080 Ti GPU.
arXiv Detail & Related papers (2023-03-15T11:13:51Z)
Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras. Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation. We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z)
UHD Image Deblurring via Multi-scale Cubic-Mixer [12.402054374952485]
transformer-based algorithms are making a splash in the domain of image deblurring. These algorithms depend on the self-attention mechanism with CNN stem to model long range dependencies between tokens.
arXiv Detail & Related papers (2022-06-08T05:04:43Z)
GPU-accelerated Faster Mean Shift with euclidean distance metrics [1.3507758562554621]
Mean-shift algorithm is widely used to solve clustering problems. In previous research, we proposed a novel GPU-accelerated Faster Mean-shift algorithm. In this study, we extend and improve the previous algorithm to handle Euclidean distance metrics.
arXiv Detail & Related papers (2021-12-27T20:18:24Z)
Parallel Discrete Convolutions on Adaptive Particle Representations of Images [2.362412515574206]
We present data structures and algorithms for native implementations of discrete convolution operators over Adaptive Particle Representations. The APR is a content-adaptive image representation that locally adapts the sampling resolution to the image signal. We show that APR convolution naturally leads to scale-adaptive algorithms that efficiently parallelize on multi-core CPU and GPU architectures.
arXiv Detail & Related papers (2021-12-07T09:40:05Z)
CNNs for JPEGs: A Study in Computational Cost [49.97673761305336]
Convolutional neural networks (CNNs) have achieved astonishing advances over the past decade. CNNs are capable of learning robust representations of the data directly from the RGB pixels. Deep learning methods capable of learning directly from the compressed domain have been gaining attention in recent years.
arXiv Detail & Related papers (2020-12-26T15:00:10Z)
Displacement-Invariant Cost Computation for Efficient Stereo Matching [122.94051630000934]
Deep learning methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy. But their inference time is typically slow, on the order of seconds for a pair of 540p images. We propose a emphdisplacement-invariant cost module to compute the matching costs without needing a 4D feature volume.
arXiv Detail & Related papers (2020-12-01T23:58:16Z)
Faster Mean-shift: GPU-accelerated clustering for cosine embedding-based cell segmentation and tracking [12.60841328582138]
We propose a novel Faster Mean-shift algorithm, which tackles the computational bottleneck of embedding based cell segmentation and tracking. The proposed Faster Mean-shift algorithm achieved 7-10 times speedup compared to the state-of-the-art embedding based cell instance segmentation and tracking algorithm. Our Faster Mean-shift algorithm also achieved the highest computational speed compared to other GPU benchmarks with optimized memory consumption.
arXiv Detail & Related papers (2020-07-28T14:52:51Z)
Efficient Video Semantic Segmentation with Labels Propagation and Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.