Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation
- URL: http://arxiv.org/abs/2411.16810v1
- Date: Mon, 25 Nov 2024 15:06:49 GMT
- Title: Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation
- Authors: Shengeng Tang, Jiayi He, Lechao Cheng, Jingjing Wu, Dan Guo, Richang Hong,
- Abstract summary: We propose a conditional diffusion model to synthesize contextually smooth transition frames.
Our approach transforms the unsupervised problem of transition frame generation into a supervised training task.
Experiments on the PHO14TENIX, USTC-CSL100, and USTC-500 datasets demonstrate the effectiveness of our method.
- Score: 45.214169930573775
- License:
- Abstract: Generating continuous sign language videos from discrete segments is challenging due to the need for smooth transitions that preserve natural flow and meaning. Traditional approaches that simply concatenate isolated signs often result in abrupt transitions, disrupting video coherence. To address this, we propose a novel framework, Sign-D2C, that employs a conditional diffusion model to synthesize contextually smooth transition frames, enabling the seamless construction of continuous sign language sequences. Our approach transforms the unsupervised problem of transition frame generation into a supervised training task by simulating the absence of transition frames through random masking of segments in long-duration sign videos. The model learns to predict these masked frames by denoising Gaussian noise, conditioned on the surrounding sign observations, allowing it to handle complex, unstructured transitions. During inference, we apply a linearly interpolating padding strategy that initializes missing frames through interpolation between boundary frames, providing a stable foundation for iterative refinement by the diffusion model. Extensive experiments on the PHOENIX14T, USTC-CSL100, and USTC-SLR500 datasets demonstrate the effectiveness of our method in producing continuous, natural sign language videos.
Related papers
- Transformer with Controlled Attention for Synchronous Motion Captioning [0.0]
This paper addresses a challenging task, synchronous motion captioning, that aim to generate a language description synchronized with human motion sequences.
Our method introduces mechanisms to control self- and cross-attention distributions of the Transformer, allowing interpretability and time-aligned text generation.
We demonstrate the superior performance of our approach through evaluation on the two available benchmark datasets, KIT-ML and HumanML3D.
arXiv Detail & Related papers (2024-09-13T20:30:29Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - Human Video Translation via Query Warping [38.9185553719231]
We present QueryWarp, a novel framework for temporally coherent human motion video translation.
We use appearance flows to warp the previous frame's query token, aligning it with the current frame's query.
This query warping imposes explicit constraints on the outputs of self-attention layers, effectively guaranteeing temporally coherent translation.
arXiv Detail & Related papers (2024-02-19T12:28:45Z) - Motion-Aware Video Frame Interpolation [49.49668436390514]
We introduce a Motion-Aware Video Frame Interpolation (MA-VFI) network, which directly estimates intermediate optical flow from consecutive frames.
It not only extracts global semantic relationships and spatial details from input frames with different receptive fields, but also effectively reduces the required computational cost and complexity.
arXiv Detail & Related papers (2024-02-05T11:00:14Z) - Training-Free Semantic Video Composition via Pre-trained Diffusion Model [96.0168609879295]
Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments.
We propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge.
Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs.
arXiv Detail & Related papers (2024-01-17T13:07:22Z) - Synthesizing Long-Term Human Motions with Diffusion Models via Coherent
Sampling [74.62570964142063]
Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions.
We propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods.
Our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream.
arXiv Detail & Related papers (2023-08-03T16:18:32Z) - Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in
the Wild [19.5702895176141]
We propose a method for capturing discnative features within each frame model.
We utilize the CNN to translate each frame into a visual feature sequence.
Experiments indicate that our method provides an effective way to make use of the spatial and temporal dependencies.
arXiv Detail & Related papers (2022-05-10T08:47:15Z) - Multi-Stage Raw Video Denoising with Adversarial Loss and Gradient Mask [14.265454188161819]
We propose a learning-based approach for denoising raw videos captured under low lighting conditions.
We first explicitly align the neighboring frames to the current frame using a convolutional neural network (CNN)
We then fuse the registered frames using another CNN to obtain the final denoised frame.
arXiv Detail & Related papers (2021-03-04T06:57:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.