Related papers: Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches

Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches

URL: http://arxiv.org/abs/2405.04771v1
Date: Wed, 8 May 2024 02:42:27 GMT
Title: Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches
Authors: Qing Yu, Mikihiro Tanaka, Kent Fujiwara,
Abstract summary: We introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning. These motion patches, created by dividing and sorting skeleton joints based on motion sequences, are robust to varying skeleton structures. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis.
Score: 12.221087476416056
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.

Related papers

IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation [58.297199313494]
Implicit methods capture motion semantics directly from driving video, but suffer from identity leakage and entanglement between motion and appearance.<n>We propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens.<n>Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity.
arXiv Detail & Related papers (2026-02-07T11:17:20Z)
MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization [73.07309070257162]
MotionAdapter is a content-aware motion transfer framework that enables robust and semantically aligned motion transfer.<n>Our key insight is that effective motion transfer requires explicit disentanglement of motion from appearance.<n> MotionAdapter naturally supports complex motion transfer and motion editing tasks such as zooming.
arXiv Detail & Related papers (2026-01-05T10:01:27Z)
VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models [110.32291962407078]
VimoRAG is a video-based retrieval-augmented motion generation framework for motion large language models.<n>We develop an effective motion-centered video retrieval model and mitigate the issue of error propagation caused by suboptimal retrieval results.<n> Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.
arXiv Detail & Related papers (2025-08-16T15:31:14Z)
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z)
In-2-4D: Inbetweening from Two Single-View Images to 4D Generation [63.68181731564576]
We propose a new problem, Inbetween-2-4D, for generative 4D (i.e., 3D + motion) in interpolate two single-view images.<n>In contrast to video/4D generation from only text or a single image, our interpolative task can leverage more precise motion control to better constrain the generation.
arXiv Detail & Related papers (2025-04-11T09:01:09Z)
Instance-Level Moving Object Segmentation from a Single Image with Events [84.12761042512452]
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects. Previous methods encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities. We propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues.
arXiv Detail & Related papers (2025-02-18T15:56:46Z)
Articulate That Object Part (ATOP): 3D Part Articulation via Text and Motion Personalization [12.944411575346528]
ATOP (Articulate That Object Part) is a novel few-shot method based on motion personalization to articulate a static 3D object with respect to a part and its motion as prescribed in a text prompt.<n>We show that our method can generate realistic motion samples with higher accuracy, leading to more generalizable 3D motion predictions.
arXiv Detail & Related papers (2025-02-11T05:47:16Z)
Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes [83.55301458112672]
Sitcom-Crafter is a system for human motion generation in 3D space. Central to the function generation modules is our novel 3D scene-aware human-human interaction module. Augmentation modules encompass plot comprehension for command generation, motion synchronization for seamless integration of different motion types.
arXiv Detail & Related papers (2024-10-14T17:56:19Z)
T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data [6.6240820702899565]
Existing methods only generate body motion data, excluding facial expressions and hand movements. Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts. We propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data.
arXiv Detail & Related papers (2024-09-20T06:20:00Z)
MotionFix: Text-Driven 3D Human Motion Editing [52.11745508960547]
Key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion. We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text. Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input.
arXiv Detail & Related papers (2024-08-01T16:58:50Z)
Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment [45.74813582690906]
Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. We present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment. The VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos.
arXiv Detail & Related papers (2024-04-15T06:38:09Z)
Realistic Human Motion Generation with Cross-Diffusion Models [30.854425772128568]
Cross Human Motion Diffusion Model (CrossDiff) Method integrates 3D and 2D information using a shared transformer network within the training of the diffusion model. CrossDiff effectively combines the strengths of both representations to generate more realistic motion sequences.
arXiv Detail & Related papers (2023-12-18T07:44:40Z)
MotionTrack: Learning Motion Predictor for Multiple Object Tracking [68.68339102749358]
We introduce a novel motion-based tracker, MotionTrack, centered around a learnable motion predictor. Our experimental results demonstrate that MotionTrack yields state-of-the-art performance on datasets such as Dancetrack and SportsMOT.
arXiv Detail & Related papers (2023-06-05T04:24:11Z)
Human MotionFormer: Transferring Human Motions with Vision Transformers [73.48118882676276]
Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis. We propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching. Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-02-22T11:42:44Z)
MotionBERT: A Unified Perspective on Learning Human Motion Representations [46.67364057245364]
We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources. We propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations. We implement motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network.
arXiv Detail & Related papers (2022-10-12T19:46:25Z)
Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy. MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z)
Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder. In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.