Related papers: SlotFM: A Motion Foundation Model with Slot Attention for Diverse Downstream Tasks

SlotFM: A Motion Foundation Model with Slot Attention for Diverse Downstream Tasks

URL: http://arxiv.org/abs/2509.21673v1
Date: Thu, 25 Sep 2025 22:41:43 GMT
Title: SlotFM: A Motion Foundation Model with Slot Attention for Diverse Downstream Tasks
Authors: Junyong Park, Oron Levy, Rebecca Adaimi, Asaf Liberman, Gierad Laput, Abdelkareem Bedri,
Abstract summary: We present SlotFM, an accelerometer foundation model that generalizes across diverse downstream tasks.<n>We evaluate SlotFM on 16 classification and regression downstream tasks that extend beyond standard human activity recognition.
Score: 2.0906347671401018
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Wearable accelerometers are used for a wide range of applications, such as gesture recognition, gait analysis, and sports monitoring. Yet most existing foundation models focus primarily on classifying common daily activities such as locomotion and exercise, limiting their applicability to the broader range of tasks that rely on other signal characteristics. We present SlotFM, an accelerometer foundation model that generalizes across diverse downstream tasks. SlotFM uses Time-Frequency Slot Attention, an extension of Slot Attention that processes both time and frequency representations of the raw signals. It generates multiple small embeddings (slots), each capturing different signal components, enabling task-specific heads to focus on the most relevant parts of the data. We also introduce two loss regularizers that capture local structure and frequency patterns, which improve reconstruction of fine-grained details and helps the embeddings preserve task-relevant information. We evaluate SlotFM on 16 classification and regression downstream tasks that extend beyond standard human activity recognition. It outperforms existing self-supervised approaches on 13 of these tasks and achieves comparable results to the best performing approaches on the remaining tasks. On average, our method yields a 4.5% performance gain, demonstrating strong generalization for sensing foundation models.

Related papers

FusAD: Time-Frequency Fusion with Adaptive Denoising for General Time Series Analysis [92.23551599659186]
Time series analysis plays a vital role in fields such as finance, healthcare, industry, and meteorology.<n>FusAD is a unified analysis framework designed for diverse time series tasks.
arXiv Detail & Related papers (2025-12-16T04:34:27Z)
OmniFD: A Unified Model for Versatile Face Forgery Detection [45.17431538516313]
We introduce OmniFD, a unified framework that jointly addresses four core forgery detection tasks within a single model.<n>Our architecture consists of three principal components: (1) a shared Swin Transformer that extracts unified 4D-temporal representations from both images and video inputs, (2) a cross-task interaction module with learnable queries, and (3) lightweight decoding heads that transform refined representations into corresponding predictions.
arXiv Detail & Related papers (2025-11-30T22:36:42Z)
FAIM: Frequency-Aware Interactive Mamba for Time Series Classification [87.84511960413715]
Time series classification (TSC) is crucial in numerous real-world applications, such as environmental monitoring, medical diagnosis, and posture recognition.<n>We propose FAIM, a lightweight Frequency-Aware Interactive Mamba model.<n>We show that FAIM consistently outperforms existing state-of-the-art (SOTA) methods, achieving a superior trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2025-11-26T08:36:33Z)
Beyond Hearing: Learning Task-agnostic ExG Representations from Earphones via Physiology-informed Tokenization [24.30887065380288]
We introduce an approach for scalable, task-agnostic ExG monitoring in the wild.<n>At the core of our approach is Physiology-informed Multi-band Tokenization (PiMT), which decomposes ExG signals into 12 physiology-informed tokens.<n>Experiments on our new DailySense dataset-the first to enable ExG-based analysis across five human senses-demonstrate that PiMT consistently outperforms state-of-the-art methods across diverse tasks.
arXiv Detail & Related papers (2025-10-22T05:11:02Z)
RefAM: Attention Magnets for Zero-Shot Referral Segmentation [103.98022860792504]
We introduce a new method that exploits features, attention scores, from diffusion transformers for downstream tasks.<n>Key insight is that stop words act as attention magnets.<n>We propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters.
arXiv Detail & Related papers (2025-09-26T17:59:57Z)
FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z)
FDDet: Frequency-Decoupling for Boundary Refinement in Temporal Action Detection [4.015022008487465]
Large-scale pre-trained video encoders tend to introduce background clutter and irrelevant semantics, leading to context confusion and boundaries.<n>We propose a frequency-aware decoupling network that improves action discriminability by filtering out noisy semantics captured by pre-trained models.<n>Our method achieves state-of-the-art performance on temporal action detection benchmarks.
arXiv Detail & Related papers (2025-04-01T10:57:37Z)
Probing Fine-Grained Action Understanding and Cross-View Generalization of Foundation Models [13.972809192907931]
Foundation models (FMs) are large neural networks trained on broad datasets. Human activity recognition in video has advanced with FMs, driven by competition among different architectures. This paper empirically evaluates how perspective changes affect different FMs in fine-grained human activity recognition.
arXiv Detail & Related papers (2024-07-22T12:59:57Z)
XTrack: Multimodal Training Boosts RGB-X Video Object Trackers [88.72203975896558]
It is crucial to ensure that knowledge gained from multimodal sensing is effectively shared.<n>Similar samples across different modalities have more knowledge to share than otherwise.<n>We propose a method for RGB-X tracker during inference, with an average +3% precision improvement over the current SOTA.
arXiv Detail & Related papers (2024-05-28T03:00:58Z)
Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos. Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras. We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z)
Adaptive Frequency Learning in Two-branch Face Forgery Detection [66.91715092251258]
We propose Adaptively learn Frequency information in the two-branch Detection framework, dubbed AFD. We liberate our network from the fixed frequency transforms, and achieve better performance with our data- and task-dependent transform layers.
arXiv Detail & Related papers (2022-03-27T14:25:52Z)
Sequence-to-Sequence Modeling for Action Identification at High Temporal Resolution [9.902223920743872]
We introduce a new action-recognition benchmark that includes subtle short-duration actions labeled at a high temporal resolution. We show that current state-of-the-art models based on segmentation produce noisy predictions when applied to these data. We propose a novel approach for high-resolution action identification, inspired by speech-recognition techniques.
arXiv Detail & Related papers (2021-11-03T21:06:36Z)
Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection [7.324459578044212]
Face presentation attack detection (PAD) is attracting a lot of attention and playing a key role in securing face recognition systems. We propose a dual-stream convolution neural networks (CNNs) framework to deal with unseen scenarios. We successfully prove the design of our proposed PAD solution in a step-wise ablation study.
arXiv Detail & Related papers (2021-09-16T13:06:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.