Related papers: DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning

DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning

URL: http://arxiv.org/abs/2512.20409v1
Date: Tue, 23 Dec 2025 14:55:53 GMT
Title: DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
Authors: Junho Yoon, Jaemo Jung, Hyunju Kim, Dongman Lee,
Abstract summary: Aligning egocentric with wearable sensors has shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability.<n>We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative.<n> Comprehensive experiments with downstream tasks on Opportunity++ and Hambi-USPWU datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.
Score: 7.149401911329968
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.

Related papers

A Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness [8.202209362704494]
We propose a Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness, termed ANet.<n>ANet integrates global semantics, local details, temporal reasoning, and boundary awareness, achieving state-of-the-art performance.
arXiv Detail & Related papers (2026-02-12T00:54:22Z)
RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z)
Strip-Fusion: Spatiotemporal Fusion for Multispectral Pedestrian Detection [0.27528170226206433]
Multispectral modalities (visible light and thermal) can boost pedestrian detection performance by providing complementary visual information.<n>Existing approaches primarily focus on spatial fusion and often neglect temporal information.<n>This work proposes Strip-Fusion, a spatial-temporal fusion network that is robust to misalignment in input images.
arXiv Detail & Related papers (2026-01-25T21:58:07Z)
TaCo: Capturing Spatio-Temporal Semantic Consistency in Remote Sensing Change Detection [54.22717266034045]
Ta-Co is a consistent semantic network for temporal semantic transitions.<n>We show that Ta-Co consistently achieves SOTA performance on remote sensing detection tasks.<n>This design can yield substantial gains without any additional computational overhead during inference.
arXiv Detail & Related papers (2025-11-25T13:44:29Z)
Belief-Conditioned One-Step Diffusion: Real-Time Trajectory Planning with Just-Enough Sensing [1.6984211127623137]
We present Belief-Conditioned One-Step Diffusion (B-COD), the first planner that, in a 10 ms forward pass, returns a short-horizon trajectory.<n>We show that this single proxy suffices for a soft-actor-critic to choose sensors online, optimising energy while bounding pose-co growth.
arXiv Detail & Related papers (2025-08-16T21:34:16Z)
Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object Tracking [16.366398265001422]
3D multi-object tracking is a critical and challenging task in the field of autonomous driving.<n>We introduce the Dynamic Scene Cue-Consistency Tracker (DSC-Track) to implement this principle.
arXiv Detail & Related papers (2025-08-15T08:48:13Z)
Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [131.33758144860988]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z)
STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing [2.231167375820083]
We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion.<n>We propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation.
arXiv Detail & Related papers (2025-03-29T11:04:10Z)
EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting [108.15136508964011]
EgoSplat is a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding.<n>EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets.
arXiv Detail & Related papers (2025-03-14T12:21:26Z)
Dynamic Position Transformation and Boundary Refinement Network for Left Atrial Segmentation [17.09918110723713]
Left atrial (LA) segmentation is a crucial technique for irregular heartbeat (i.e., atrial fibrillation) diagnosis. Most current methods for LA segmentation strictly assume that the input data is acquired using object-oriented center cropping. We propose a novel Dynamic Position transformation and Boundary refinement Network (DPBNet) to tackle these issues.
arXiv Detail & Related papers (2024-07-07T22:09:35Z)
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time. To this end, we propose AE2, a self-supervised embedding approach with two key designs. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z)
A Spatial-Temporal Attentive Network with Spatial Continuity for Trajectory Prediction [74.00750936752418]
We propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC) First, spatial-temporal attention mechanism is presented to explore the most useful and important information. Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity.
arXiv Detail & Related papers (2020-03-13T04:35:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.