Attention and Encoder-Decoder based models for transforming articulatory
movements at different speaking rates
- URL: http://arxiv.org/abs/2006.03107v2
- Date: Thu, 20 Aug 2020 05:00:07 GMT
- Title: Attention and Encoder-Decoder based models for transforming articulatory
movements at different speaking rates
- Authors: Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh
- Abstract summary: We propose an encoder-decoder architecture using LSTMs which generates smoother predicted articulatory trajectories.
We analyze amplitude of the transformed articulatory movements at different rates compared to their original counterparts.
We observe that AstNet could model both duration and extent of articulatory movements better than the existing transformation techniques.
- Score: 60.02121449986413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While speaking at different rates, articulators (like tongue, lips) tend to
move differently and the enunciations are also of different durations. In the
past, affine transformation and DNN have been used to transform articulatory
movements from neutral to fast(N2F) and neutral to slow(N2S) speaking rates
[1]. In this work, we improve over the existing transformation techniques by
modeling rate specific durations and their transformation using AstNet, an
encoder-decoder framework with attention. In the current work, we propose an
encoder-decoder architecture using LSTMs which generates smoother predicted
articulatory trajectories. For modeling duration variations across speaking
rates, we deploy attention network, which eliminates the needto align
trajectories in different rates using DTW. We performa phoneme specific
duration analysis to examine how well duration is transformed using the
proposed AstNet. As the range of articulatory motions is correlated with
speaking rate, we also analyze amplitude of the transformed articulatory
movements at different rates compared to their original counterparts, to
examine how well the proposed AstNet predicts the extent of articulatory
movements in N2F and N2S. We observe that AstNet could model both duration and
extent of articulatory movements better than the existing transformation
techniques resulting in more accurate transformed articulatory trajectories.
Related papers
- Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning [30.51005522218133]
We introduce a novel Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning (ZSL)
The STFT leverage the temporal and semantic information from different time steps to generate robust representations.
We propose a global-local pooling (GLP) which combines the max and average pooling operations.
arXiv Detail & Related papers (2024-07-11T02:01:26Z) - A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection [7.202931445597171]
We present a novel network that detects actions in untrimmed videos.
The network encodes the locations of action semantics in video frames utilizing motion-aware 2D positional encoding.
The approach outperforms the state-the-art solutions on four proposed datasets.
arXiv Detail & Related papers (2024-05-13T21:47:35Z) - Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms.
SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics.
Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z) - Audio2Gestures: Generating Diverse Gestures from Audio [28.026220492342382]
We propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code.
Our method generates more realistic and diverse motions than previous state-of-the-art methods.
arXiv Detail & Related papers (2023-01-17T04:09:58Z) - Diverse Dance Synthesis via Keyframes with Transformer Controllers [10.23813069057791]
We propose a novel motion-based motion generation network based on multiple constraints, which can achieve diverse dance synthesis via learned knowledge.
The backbone of our network is a hierarchical RNN module composed of two long short-term memory (LSTM) units, in which the first LSTM is utilized to embed the posture information of the historical frames into a latent space.
Our framework contains two Transformer-based controllers, which are used to model the constraints of the root trajectory and the velocity factor respectively.
arXiv Detail & Related papers (2022-07-13T00:56:46Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy.
MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Robust Motion In-betweening [17.473287573543065]
We present a novel, robust transition generation technique that can serve as a new tool for 3D animators.
The system synthesizes high-quality motions that use temporally-sparsers as animation constraints.
We present a custom MotionBuilder plugin that uses our trained model to perform in-betweening in production scenarios.
arXiv Detail & Related papers (2021-02-09T16:52:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.