Related papers: Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation

Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation

URL: http://arxiv.org/abs/2512.11654v1
Date: Fri, 12 Dec 2025 15:32:28 GMT
Title: Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation
Authors: Luca Cazzola, Ahed Alboody,
Abstract summary: We propose KineMIC, a transfer learning framework for few-shot action synthesis.<n>We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data.<n>Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement.
Score: 0.29465623430708904
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (https://lucazzola.github.io/publications/kinemic).

Related papers

T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation [3.6564162676635363]
Text-to-motion generation has attracted increasing attention in fields such as avatar animation and humanoid robotic interaction.<n>Models treat motion periodicity and saliency as independent factors, overlooking their coupling and causing generation drift in long sequences.<n>We propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba.
arXiv Detail & Related papers (2026-02-01T17:42:53Z)
A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition [87.12969639957441]
Action recognition has been dominated by transformer-based methods, thanks to their contextual aggregation capacities.<n>We propose to integrate those effective motion modeling properties into the existing transformer in a unified and neat way.<n>Our method performs better than existing state-of-the-art approaches, especially on motion-sensitive datasets.
arXiv Detail & Related papers (2025-10-21T15:01:48Z)
DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models [103.18486625853099]
DEFT, Decompositional Efficient Fine-Tuning, adapts a pre-trained weight matrix by decomposing its update into two components.<n>We conduct experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework.
arXiv Detail & Related papers (2025-09-26T18:01:15Z)
Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation [66.66243874361103]
dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data.<n>We propose Concept-Aware LoRA, a novel fine-tuning approach that selectively identifies and updates only the weights associated with necessary concepts for domain alignment.<n>We demonstrate its effectiveness in generating datasets for urban-scene segmentation, outperforming baseline and state-of-the-art methods in in-domain settings.
arXiv Detail & Related papers (2025-03-28T06:23:29Z)
Matching Skeleton-based Activity Representations with Heterogeneous Signals for HAR [30.418663483793804]
We propose SKELAR, a novel framework that pretrains activity representations from skeleton data and matches them with heterogeneous HAR signals.<n>SKELAR achieves the state-of-the-art performance in both full-shot and few-shot settings.<n>We also demonstrate that SKELAR can effectively leverage synthetic skeleton data to extend its use in scenarios without skeleton collections.
arXiv Detail & Related papers (2025-03-17T18:43:06Z)
Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation [19.094098673523263]
We propose a novel framework for fine-grained text-driven human motion generation.<n>Fg-T2M++ consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units, and (3) a multi-modal fusion module to hierarchically fuse text and motion features.
arXiv Detail & Related papers (2025-02-08T11:38:12Z)
KETA: Kinematic-Phrases-Enhanced Text-to-Motion Generation via Fine-grained Alignment [5.287416596074742]
State-of-the-art T2M techniques mainly leverage diffusion models to generate motions with text prompts as guidance.<n>We propose KETA, which decomposes the given text into several decomposed texts via a language model.<n>Experiments demonstrate that KETA achieves up to 1.19x, 2.34x better R precision and FID value on both backbones of the base model, motion diffusion model.
arXiv Detail & Related papers (2025-01-25T03:43:33Z)
T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data [6.6240820702899565]
Existing methods only generate body motion data, excluding facial expressions and hand movements. Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts. We propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data.
arXiv Detail & Related papers (2024-09-20T06:20:00Z)
Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z)
TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts [20.336481832461168]
Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts. We propose the use of motion token, a discrete and compact motion representation. Our approach is flexible, could be used for both text2motion and motion2text tasks.
arXiv Detail & Related papers (2022-07-04T19:52:18Z)
Style-Hallucinated Dual Consistency Learning for Domain Generalized Semantic Segmentation [117.3856882511919]
We propose the Style-HAllucinated Dual consistEncy learning (SHADE) framework to handle domain shift. Our SHADE yields significant improvement and outperforms state-of-the-art methods by 5.07% and 8.35% on the average mIoU of three real-world datasets.
arXiv Detail & Related papers (2022-04-06T02:49:06Z)
Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy. MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.