Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation
- URL: http://arxiv.org/abs/2512.11654v1
- Date: Fri, 12 Dec 2025 15:32:28 GMT
- Title: Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation
- Authors: Luca Cazzola, Ahed Alboody,
- Abstract summary: We propose KineMIC, a transfer learning framework for few-shot action synthesis.<n>We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data.<n>Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement.
- Score: 0.29465623430708904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (https://lucazzola.github.io/publications/kinemic).
Related papers
- T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation [3.6564162676635363]
Text-to-motion generation has attracted increasing attention in fields such as avatar animation and humanoid robotic interaction.<n>Models treat motion periodicity and saliency as independent factors, overlooking their coupling and causing generation drift in long sequences.<n>We propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba.
arXiv Detail & Related papers (2026-02-01T17:42:53Z) - A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition [87.12969639957441]
Action recognition has been dominated by transformer-based methods, thanks to their contextual aggregation capacities.<n>We propose to integrate those effective motion modeling properties into the existing transformer in a unified and neat way.<n>Our method performs better than existing state-of-the-art approaches, especially on motion-sensitive datasets.
arXiv Detail & Related papers (2025-10-21T15:01:48Z) - DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models [103.18486625853099]
DEFT, Decompositional Efficient Fine-Tuning, adapts a pre-trained weight matrix by decomposing its update into two components.<n>We conduct experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework.
arXiv Detail & Related papers (2025-09-26T18:01:15Z) - Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation [66.66243874361103]
dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data.<n>We propose Concept-Aware LoRA, a novel fine-tuning approach that selectively identifies and updates only the weights associated with necessary concepts for domain alignment.<n>We demonstrate its effectiveness in generating datasets for urban-scene segmentation, outperforming baseline and state-of-the-art methods in in-domain settings.
arXiv Detail & Related papers (2025-03-28T06:23:29Z) - Matching Skeleton-based Activity Representations with Heterogeneous Signals for HAR [30.418663483793804]
We propose SKELAR, a novel framework that pretrains activity representations from skeleton data and matches them with heterogeneous HAR signals.<n>SKELAR achieves the state-of-the-art performance in both full-shot and few-shot settings.<n>We also demonstrate that SKELAR can effectively leverage synthetic skeleton data to extend its use in scenarios without skeleton collections.
arXiv Detail & Related papers (2025-03-17T18:43:06Z) - Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation [19.094098673523263]
We propose a novel framework for fine-grained text-driven human motion generation.<n>Fg-T2M++ consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units, and (3) a multi-modal fusion module to hierarchically fuse text and motion features.
arXiv Detail & Related papers (2025-02-08T11:38:12Z) - KETA: Kinematic-Phrases-Enhanced Text-to-Motion Generation via Fine-grained Alignment [5.287416596074742]
State-of-the-art T2M techniques mainly leverage diffusion models to generate motions with text prompts as guidance.<n>We propose KETA, which decomposes the given text into several decomposed texts via a language model.<n>Experiments demonstrate that KETA achieves up to 1.19x, 2.34x better R precision and FID value on both backbones of the base model, motion diffusion model.
arXiv Detail & Related papers (2025-01-25T03:43:33Z) - T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data [6.6240820702899565]
Existing methods only generate body motion data, excluding facial expressions and hand movements.
Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts.
We propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data.
arXiv Detail & Related papers (2024-09-20T06:20:00Z) - Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of
3D Human Motions and Texts [20.336481832461168]
Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts.
We propose the use of motion token, a discrete and compact motion representation.
Our approach is flexible, could be used for both text2motion and motion2text tasks.
arXiv Detail & Related papers (2022-07-04T19:52:18Z) - Style-Hallucinated Dual Consistency Learning for Domain Generalized
Semantic Segmentation [117.3856882511919]
We propose the Style-HAllucinated Dual consistEncy learning (SHADE) framework to handle domain shift.
Our SHADE yields significant improvement and outperforms state-of-the-art methods by 5.07% and 8.35% on the average mIoU of three real-world datasets.
arXiv Detail & Related papers (2022-04-06T02:49:06Z) - Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy.
MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.