Related papers: A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation

A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation

URL: http://arxiv.org/abs/2507.00676v1
Date: Tue, 01 Jul 2025 11:18:23 GMT
Title: A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation
Authors: Edward Effendy, Kuan-Wei Tseng, Rei Kawakami,
Abstract summary: We present a novel transformer-based framework for whole-body grasping.<n>It addresses pose generation and motion infilling, enabling realistic and stable object interactions.<n>Our method outperforms state-of-the-art baselines in terms of coherence, stability, and visual realism.
Score: 6.465569743109499
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accepted in the ICIP 2025 We present a novel transformer-based framework for whole-body grasping that addresses both pose generation and motion infilling, enabling realistic and stable object interactions. Our pipeline comprises three stages: Grasp Pose Generation for full-body grasp generation, Temporal Infilling for smooth motion continuity, and a LiftUp Transformer that refines downsampled joints back to high-resolution markers. To overcome the scarcity of hand-object interaction data, we introduce a data-efficient Generalized Pretraining stage on large, diverse motion datasets, yielding robust spatio-temporal representations transferable to grasping tasks. Experiments on the GRAB dataset show that our method outperforms state-of-the-art baselines in terms of coherence, stability, and visual realism. The modular design also supports easy adaptation to other human-motion applications.

Related papers

SILK: Smooth InterpoLation frameworK for motion in-betweening A Simplified Computational Approach [1.7812314225208412]
Motion in-betweening is a crucial tool for animators, enabling control over pose-level details in each pose.<n>Recent machine learning solutions for motion in-betweening rely on complex models, skeleton-aware architectures or requiring multiple modules and training steps.<n>We introduce a simple yet effective Transformer-based framework, employing a single encoder to synthesize realistic motions for motion in-betweening tasks.
arXiv Detail & Related papers (2025-06-09T19:26:27Z)
Absolute Coordinates Make Motion Generation Easy [8.153961351540834]
State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D.<n>We propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space.
arXiv Detail & Related papers (2025-05-26T00:36:00Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Motion-Aware Generative Frame Interpolation [23.380470636851022]
Flow-based frame methods ensure motion stability through estimated intermediate flow but often introduce severe artifacts in complex motion regions.<n>Recent generative approaches, boosted by large-scale pre-trained video generation models, show promise in handling intricate scenes.<n>We propose Motion-aware Generative frame (MoG) that synergizes intermediate flow guidance with generative capacities to enhance fidelity.
arXiv Detail & Related papers (2025-01-07T11:03:43Z)
Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)
Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony [55.26315526382004]
We propose a novel framework, Combo, for co-speech holistic 3D human motion generation. In particular, we identify that one fundamental challenge as the multiple-input-multiple-output nature of the generative model of interest. Combo is highly effective in generating high-quality motions but also efficient in transferring identity and emotion.
arXiv Detail & Related papers (2024-08-18T07:48:49Z)
TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects. generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z)
Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling [13.284947022380404]
We propose a two-stage framework that can obtain accurate and smooth full-body motions with three tracking signals of head and hands only. Our framework explicitly models the joint-level features in the first stage and utilizes them astemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage. With extensive experiments on the AMASS motion dataset and real-captured data, we show our proposed method can achieve more accurate and smooth motion compared to existing approaches.
arXiv Detail & Related papers (2023-08-17T08:27:55Z)
Blur Interpolation Transformer for Real-World Motion from Blur [52.10523711510876]
We propose a encoded blur transformer (BiT) to unravel the underlying temporal correlation in blur. Based on multi-scale residual Swin transformer blocks, we introduce dual-end temporal supervision and temporally symmetric ensembling strategies. In addition, we design a hybrid camera system to collect the first real-world dataset of one-to-many blur-sharp video pairs.
arXiv Detail & Related papers (2022-11-21T13:10:10Z)
End-to-end Contextual Perception and Prediction with Interaction Transformer [79.14001602890417]
We tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving. To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture. Our model can be trained end-to-end, and runs in real-time.
arXiv Detail & Related papers (2020-08-13T14:30:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.