ViFiT: Reconstructing Vision Trajectories from IMU and Wi-Fi Fine Time
Measurements
- URL: http://arxiv.org/abs/2310.03140v1
- Date: Wed, 4 Oct 2023 20:05:40 GMT
- Title: ViFiT: Reconstructing Vision Trajectories from IMU and Wi-Fi Fine Time
Measurements
- Authors: Bryan Bo Cao, Abrar Alali, Hansi Liu, Nicholas Meegan, Marco Gruteser,
Kristin Dana, Ashwin Ashok, Shubham Jain
- Abstract summary: We propose ViFiT, a transformer-based model that reconstructs vision bounding box trajectories from phone data (IMU and Fine Time Measurements)
ViFiT achieves an MRFR of 0.65 that outperforms the state-of-the-art approach for cross-modal reconstruction in LSTM-Decoder architecture.
- Score: 6.632056181867312
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Tracking subjects in videos is one of the most widely used functions in
camera-based IoT applications such as security surveillance, smart city traffic
safety enhancement, vehicle to pedestrian communication and so on. In the
computer vision domain, tracking is usually achieved by first detecting
subjects with bounding boxes, then associating detected bounding boxes across
video frames. For many IoT systems, images captured by cameras are usually sent
over the network to be processed at a different site that has more powerful
computing resources than edge devices. However, sending entire frames through
the network causes significant bandwidth consumption that may exceed the system
bandwidth constraints. To tackle this problem, we propose ViFiT, a
transformer-based model that reconstructs vision bounding box trajectories from
phone data (IMU and Fine Time Measurements). It leverages a transformer ability
of better modeling long-term time series data. ViFiT is evaluated on Vi-Fi
Dataset, a large-scale multimodal dataset in 5 diverse real world scenes,
including indoor and outdoor environments. To fill the gap of proper metrics of
jointly capturing the system characteristics of both tracking quality and video
bandwidth reduction, we propose a novel evaluation framework dubbed Minimum
Required Frames (MRF) and Minimum Required Frames Ratio (MRFR). ViFiT achieves
an MRFR of 0.65 that outperforms the state-of-the-art approach for cross-modal
reconstruction in LSTM Encoder-Decoder architecture X-Translator of 0.98,
resulting in a high frame reduction rate as 97.76%.
Related papers
- Unveiling the Limits of Alignment: Multi-modal Dynamic Local Fusion Network and A Benchmark for Unaligned RGBT Video Object Detection [5.068440399797739]
Current RGB-Thermal Video Object Detection (RGBT VOD) methods depend on manually aligning data at the image level.
We propose a Multi-modal Dynamic Local fusion Network (MDLNet) designed to handle unaligned RGBT pairs.
We conduct a comprehensive evaluation and comparison with MDLNet and state-of-the-art (SOTA) models, demonstrating the superior effectiveness of MDLNet.
arXiv Detail & Related papers (2024-10-16T01:06:12Z) - VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization [115.64739269488965]
VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
arXiv Detail & Related papers (2024-04-30T15:49:03Z) - Distributed Radiance Fields for Edge Video Compression and Metaverse
Integration in Autonomous Driving [13.536641570721798]
metaverse is a virtual space that combines physical and digital elements, creating immersive and connected digital worlds.
Digital twins (DTs) offer virtual prototyping, prediction, and more.
DTs can be created with 3D scene reconstruction methods that capture the real world's geometry, appearance, and dynamics.
arXiv Detail & Related papers (2024-02-22T15:39:58Z) - Federated Multi-View Synthesizing for Metaverse [52.59476179535153]
The metaverse is expected to provide immersive entertainment, education, and business applications.
Virtual reality (VR) transmission over wireless networks is data- and computation-intensive.
We have developed a novel multi-view synthesizing framework that can efficiently provide synthesizing, storage, and communication resources for wireless content delivery in the metaverse.
arXiv Detail & Related papers (2023-12-18T13:51:56Z) - No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention
and Zoom-in Boundary Detection [52.03562682785128]
Temporal video grounding aims to retrieve the time interval of a language query from an untrimmed video.
A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR.
We propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection.
arXiv Detail & Related papers (2023-07-20T04:12:10Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - IDO-VFI: Identifying Dynamics via Optical Flow Guidance for Video Frame
Interpolation with Events [14.098949778274733]
Event cameras are ideal for capturing inter-frame dynamics with their extremely high temporal resolution.
We propose an event-and-frame-based video frame method named IDO-VFI that assigns varying amounts of computation for different sub-regions.
Our proposed method maintains high-quality performance while reducing computation time and computational effort by 10% and 17% respectively on Vimeo90K datasets.
arXiv Detail & Related papers (2023-05-17T13:22:21Z) - ViFiCon: Vision and Wireless Association Via Self-Supervised Contrastive
Learning [5.5232283752707785]
ViFiCon is a self-supervised contrastive learning scheme which uses synchronized information across vision and wireless modalities to perform cross-modal association.
We show that ViFiCon achieves high performance vision-to- wireless association, finding which bounding box corresponds to which smartphone device.
arXiv Detail & Related papers (2022-10-11T15:04:05Z) - Video Frame Interpolation with Transformer [55.12620857638253]
We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames.
Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
arXiv Detail & Related papers (2022-05-15T09:30:28Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - FILM: Frame Interpolation for Large Motion [20.04001872133824]
We present a frame algorithm that synthesizes multiple intermediate frames from two input images with large in-between motion.
Our approach outperforms state-of-the-art methods on the Xiph large motion benchmark.
arXiv Detail & Related papers (2022-02-10T08:48:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.