ViFiT: Reconstructing Vision Trajectories from IMU and Wi-Fi Fine Time
Measurements
- URL: http://arxiv.org/abs/2310.03140v1
- Date: Wed, 4 Oct 2023 20:05:40 GMT
- Title: ViFiT: Reconstructing Vision Trajectories from IMU and Wi-Fi Fine Time
Measurements
- Authors: Bryan Bo Cao, Abrar Alali, Hansi Liu, Nicholas Meegan, Marco Gruteser,
Kristin Dana, Ashwin Ashok, Shubham Jain
- Abstract summary: We propose ViFiT, a transformer-based model that reconstructs vision bounding box trajectories from phone data (IMU and Fine Time Measurements)
ViFiT achieves an MRFR of 0.65 that outperforms the state-of-the-art approach for cross-modal reconstruction in LSTM-Decoder architecture.
- Score: 6.632056181867312
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Tracking subjects in videos is one of the most widely used functions in
camera-based IoT applications such as security surveillance, smart city traffic
safety enhancement, vehicle to pedestrian communication and so on. In the
computer vision domain, tracking is usually achieved by first detecting
subjects with bounding boxes, then associating detected bounding boxes across
video frames. For many IoT systems, images captured by cameras are usually sent
over the network to be processed at a different site that has more powerful
computing resources than edge devices. However, sending entire frames through
the network causes significant bandwidth consumption that may exceed the system
bandwidth constraints. To tackle this problem, we propose ViFiT, a
transformer-based model that reconstructs vision bounding box trajectories from
phone data (IMU and Fine Time Measurements). It leverages a transformer ability
of better modeling long-term time series data. ViFiT is evaluated on Vi-Fi
Dataset, a large-scale multimodal dataset in 5 diverse real world scenes,
including indoor and outdoor environments. To fill the gap of proper metrics of
jointly capturing the system characteristics of both tracking quality and video
bandwidth reduction, we propose a novel evaluation framework dubbed Minimum
Required Frames (MRF) and Minimum Required Frames Ratio (MRFR). ViFiT achieves
an MRFR of 0.65 that outperforms the state-of-the-art approach for cross-modal
reconstruction in LSTM Encoder-Decoder architecture X-Translator of 0.98,
resulting in a high frame reduction rate as 97.76%.
Related papers
- Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields [39.214857326425204]
Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames.
We propose a novel event-based VFI framework with cross-modal asymmetric bidirectional motion field estimation.
Our method shows significant performance improvement over the state-of-the-art VFI methods on various datasets.
arXiv Detail & Related papers (2025-02-19T13:40:43Z) - Track-On: Transformer-based Online Point Tracking with Memory [34.744546679670734]
We introduce Track-On, a simple transformer-based model designed for online long-term point tracking.
Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames.
At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy.
arXiv Detail & Related papers (2025-01-30T17:04:11Z) - VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization [115.64739269488965]
VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
arXiv Detail & Related papers (2024-04-30T15:49:03Z) - Distributed Radiance Fields for Edge Video Compression and Metaverse
Integration in Autonomous Driving [13.536641570721798]
metaverse is a virtual space that combines physical and digital elements, creating immersive and connected digital worlds.
Digital twins (DTs) offer virtual prototyping, prediction, and more.
DTs can be created with 3D scene reconstruction methods that capture the real world's geometry, appearance, and dynamics.
arXiv Detail & Related papers (2024-02-22T15:39:58Z) - No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention
and Zoom-in Boundary Detection [52.03562682785128]
Temporal video grounding aims to retrieve the time interval of a language query from an untrimmed video.
A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR.
We propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection.
arXiv Detail & Related papers (2023-07-20T04:12:10Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild [72.0226493284814]
We propose a unified framework for event-based frame that performs deblurring ad-hoc.
Our network consistently outperforms previous state-of-the-art methods on frame, single image deblurring, and the joint task of both.
arXiv Detail & Related papers (2023-01-12T18:19:00Z) - ViFiCon: Vision and Wireless Association Via Self-Supervised Contrastive
Learning [5.5232283752707785]
ViFiCon is a self-supervised contrastive learning scheme which uses synchronized information across vision and wireless modalities to perform cross-modal association.
We show that ViFiCon achieves high performance vision-to- wireless association, finding which bounding box corresponds to which smartphone device.
arXiv Detail & Related papers (2022-10-11T15:04:05Z) - Video Frame Interpolation with Transformer [55.12620857638253]
We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames.
Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
arXiv Detail & Related papers (2022-05-15T09:30:28Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - FILM: Frame Interpolation for Large Motion [20.04001872133824]
We present a frame algorithm that synthesizes multiple intermediate frames from two input images with large in-between motion.
Our approach outperforms state-of-the-art methods on the Xiph large motion benchmark.
arXiv Detail & Related papers (2022-02-10T08:48:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.