TransiT: Transient Transformer for Non-line-of-sight Videography
- URL: http://arxiv.org/abs/2503.11328v1
- Date: Fri, 14 Mar 2025 11:56:37 GMT
- Title: TransiT: Transient Transformer for Non-line-of-sight Videography
- Authors: Ruiqian Li, Siyuan Shen, Suan Xia, Ziheng Wang, Xingyue Peng, Chengxuan Song, Yingsheng Zhu, Tao Wu, Shiying Li, Jingyi Yu,
- Abstract summary: We propose a new Transient Transformer architecture called TransiT to achieve real-time NLOS recovery under fast scans.<n>TransiT directly compresses the temporal dimension of input transients to extract features, reducing computation costs and meeting high frame rate requirements.<n>In real experiments, TransiT manages to reconstruct from sparse transients of $16 times 16$ measured at an exposure time of 0.4 ms per point to NLOS videos at a $64 times 64$ resolution at 10 frames per second.
- Score: 28.571430723113117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High quality and high speed videography using Non-Line-of-Sight (NLOS) imaging benefit autonomous navigation, collision prevention, and post-disaster search and rescue tasks. Current solutions have to balance between the frame rate and image quality. High frame rates, for example, can be achieved by reducing either per-point scanning time or scanning density, but at the cost of lowering the information density at individual frames. Fast scanning process further reduces the signal-to-noise ratio and different scanning systems exhibit different distortion characteristics. In this work, we design and employ a new Transient Transformer architecture called TransiT to achieve real-time NLOS recovery under fast scans. TransiT directly compresses the temporal dimension of input transients to extract features, reducing computation costs and meeting high frame rate requirements. It further adopts a feature fusion mechanism as well as employs a spatial-temporal Transformer to help capture features of NLOS transient videos. Moreover, TransiT applies transfer learning to bridge the gap between synthetic and real-measured data. In real experiments, TransiT manages to reconstruct from sparse transients of $16 \times 16$ measured at an exposure time of 0.4 ms per point to NLOS videos at a $64 \times 64$ resolution at 10 frames per second. We will make our code and dataset available to the community.
Related papers
- DDT: Decoupled Diffusion Transformer [51.84206763079382]
Diffusion transformers encode noisy inputs to extract semantic component and decode higher frequency with identical modules.
textbfcolorddtDecoupled textbfcolorddtTransformer(textbfcolorddtDDT)
textbfcolorddtTransformer(textbfcolorddtDDT)
textbfcolorddtTransformer(textbfcolorddtDDT)
arXiv Detail & Related papers (2025-04-08T07:17:45Z) - Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring [14.839956958725883]
We propose blurbfBSSTNet, textbfBlur-aware textbfStext-temporal textbfTransformer Network.
The proposed BSSTNet outperforms the state-of-the-art methods on the GoPro and DVD datasets.
arXiv Detail & Related papers (2024-06-11T17:59:56Z) - Streaming quanta sensors for online, high-performance imaging and vision [34.098174669870126]
quanta image sensors (QIS) have demonstrated remarkable imaging capabilities in many challenging scenarios.
Despite their potential, the adoption of these sensors is severely hampered by (a) high data rates and (b) the need for new computational pipelines to handle the unconventional raw data.
We introduce a simple, low-bandwidth computational pipeline to address these challenges.
Our approach results in significant data bandwidth reductions 100X and real-time image reconstruction and computer vision.
arXiv Detail & Related papers (2024-06-02T20:30:49Z) - LIPT: Latency-aware Image Processing Transformer [17.802838753201385]
We present a latency-aware image processing transformer, termed LIPT.
We devise the low-latency proportion LIPT block that substitutes memory-intensive operators with the combination of self-attention and convolutions to achieve practical speedup.
arXiv Detail & Related papers (2024-04-09T07:25:30Z) - Progressive Learning with Visual Prompt Tuning for Variable-Rate Image
Compression [60.689646881479064]
We propose a progressive learning paradigm for transformer-based variable-rate image compression.
Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively.
Our model outperforms all current variable image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed image compression methods trained from scratch.
arXiv Detail & Related papers (2023-11-23T08:29:32Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z) - Spatiotemporal Attention-based Semantic Compression for Real-time Video
Recognition [117.98023585449808]
We propose a temporal attention-based autoencoder (STAE) architecture to evaluate the importance of frames and pixels in each frame.
We develop a lightweight decoder that leverages a 3D-2D CNN combined to reconstruct missing information.
Experimental results show that ViT_STAE can compress the video dataset H51 by 104x with only 5% accuracy loss.
arXiv Detail & Related papers (2023-05-22T07:47:27Z) - ITSRN++: Stronger and Better Implicit Transformer Network for Continuous
Screen Content Image Super-Resolution [32.441761727608856]
The proposed method achieves state-of-the-art performance for SCI SR (outperforming SwinIR by 0.74 dB for x3 SR) and also works well for natural image SR.
We construct a large scale SCI2K dataset to facilitate the research on SCI SR.
arXiv Detail & Related papers (2022-10-17T07:47:34Z) - Rolling Shutter Inversion: Bring Rolling Shutter Images to High
Framerate Global Shutter Video [111.08121952640766]
This paper presents a novel deep-learning based solution to the RS temporal super-resolution problem.
By leveraging the multi-view geometry relationship of the RS imaging process, our framework successfully achieves high framerate GS generation.
Our method can produce high-quality GS image sequences with rich details, outperforming the state-of-the-art methods.
arXiv Detail & Related papers (2022-10-06T16:47:12Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z) - Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable
End-to-End Speech Recognition [8.046120977786702]
Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR)
The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR.
We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6% WER on test-clean) without external language models.
arXiv Detail & Related papers (2020-08-13T08:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.