Related papers: Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning

Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning

URL: http://arxiv.org/abs/2508.14574v1
Date: Wed, 20 Aug 2025 09:52:51 GMT
Title: Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning
Authors: Guilhem Fauré, Mostafa Sadeghi, Sam Bigeard, Slim Ouni,
Abstract summary: We propose two enhancements to the standard Progressive Transformers (PT) architecture.<n>First, we encode poses using bone rotations in quaternion space and train with a geodesic loss to improve the accuracy and clarity of angular joint movements.<n>Second, we introduce a contrastive loss to structure decoder embeddings by semantic similarity, using either gloss overlap or SBERT-based sentence similarity.
Score: 7.740338361213371
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One of the main challenges in neural sign language production (SLP) lies in the high intra-class variability of signs, arising from signer morphology and stylistic variety in the training data. To improve robustness to such variations, we propose two enhancements to the standard Progressive Transformers (PT) architecture (Saunders et al., 2020). First, we encode poses using bone rotations in quaternion space and train with a geodesic loss to improve the accuracy and clarity of angular joint movements. Second, we introduce a contrastive loss to structure decoder embeddings by semantic similarity, using either gloss overlap or SBERT-based sentence similarity, aiming to filter out anatomical and stylistic features that do not convey relevant semantic information. On the Phoenix14T dataset, the contrastive loss alone yields a 16% improvement in Probability of Correct Keypoint over the PT baseline. When combined with quaternion-based pose encoding, the model achieves a 6% reduction in Mean Bone Angle Error. These results point to the benefit of incorporating skeletal structure modeling and semantically guided contrastive objectives on sign pose representations into the training of Transformer-based SLP models.

Related papers

A$^{2}$V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production [0.9384603486206738]
A$2$V-SLP learns articulator-wise disentangled latent distributions rather than deterministic embeddings.<n>A disentangled Variational Autoencoder encodes ground-truth sign pose sequences and extracts articulator-specific mean and variance vectors.
arXiv Detail & Related papers (2026-02-12T12:07:32Z)
TIP: Resisting Gradient Inversion via Targeted Interpretable Perturbation in Federated Learning [8.156452885913108]
Federated Learning (FL) facilitates collaborative model training while preserving data locality.<n>The exchange of gradients renders the system vulnerable to Gradient Inversion Attacks (GIAs)<n>We propose Targeted Interpretable Perturbation (TIP), a novel defense framework that integrates model interpretability with frequency domain analysis.
arXiv Detail & Related papers (2026-02-12T06:32:49Z)
SA$^{2}$Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging [21.660042213751794]
We propose a novel structure-aware network (SA$2$Net) for effective spine segmentation.<n>First, we propose a scale-adaptive complementary strategy to learn the cross-dimensional long-distance correlation features for spinal images.<n>Second, we transform semantic features with class-specific affinity and combine it with a Transformer decoder for structure-aware reasoning.
arXiv Detail & Related papers (2025-10-30T14:58:16Z)
Addressing Gradient Misalignment in Data-Augmented Training for Robust Speech Deepfake Detection [60.515439134387755]
We propose a dual-path data-augmented (DPDA) training framework with gradient alignment for speech deepfake detection (SDD)<n>In our framework, each training utterance is processed through two input paths: one using the original speech and the other with its augmented version.<n>Our method achieves up to an 18.69% relative reduction in Equal Error Rate on the In-the-Wild dataset compared to the baseline.
arXiv Detail & Related papers (2025-09-25T02:31:54Z)
Exploring Pose-based Sign Language Translation: Ablation Studies and Attention Insights [0.5277756703318045]
Sign Language Translation (SLT) has evolved significantly, moving from isolated recognition approaches to complex, continuous gloss-free translation systems.<n>This paper explores the impact of pose-based data preprocessing techniques on SLT performance.<n>We employ a transformer-based architecture, adapting a modified T5 encoder-decoder model to process pose representations.
arXiv Detail & Related papers (2025-07-02T09:36:26Z)
Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization [1.8024397171920885]
We train a pose autoencoder that encodes sign poses into a compact latent space using an articulator-based disentanglement strategy.<n>Next, a non-autoregressive transformer decoder is trained to predict these latent representations from sentence-level text embeddings.<n>Our approach does not rely on gloss supervision or pretrained models, and achieves state-of-the-art results on the PHOENIX14T and CSL-DailyPHOENIX datasets.
arXiv Detail & Related papers (2025-04-09T06:14:19Z)
It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment [72.75844404617959]
This paper proposes a novel cross-granularity alignment gait recognition method, named XGait. To achieve this goal, the XGait first contains two branches of backbone encoders to map the silhouette sequences and the parsing sequences into two latent spaces. Comprehensive experiments on two large-scale gait datasets show XGait with the Rank-1 accuracy of 80.5% on Gait3D and 88.3% CCPG.
arXiv Detail & Related papers (2024-11-16T08:54:27Z)
PseudoNeg-MAE: Self-Supervised Point Cloud Learning using Conditional Pseudo-Negative Embeddings [55.55445978692678]
PseudoNeg-MAE enhances global feature representation of point cloud masked autoencoders by making them both discriminative and sensitive to transformations.<n>We propose a novel loss that explicitly penalizes invariant collapse, enabling the network to capture richer transformation cues while preserving discriminative representations.
arXiv Detail & Related papers (2024-09-24T07:57:21Z)
Dual-scale Enhanced and Cross-generative Consistency Learning for Semi-supervised Medical Image Segmentation [49.57907601086494]
Medical image segmentation plays a crucial role in computer-aided diagnosis. We propose a novel Dual-scale Enhanced and Cross-generative consistency learning framework for semi-supervised medical image (DEC-Seg)
arXiv Detail & Related papers (2023-12-26T12:56:31Z)
Domain Adaptive Synapse Detection with Weak Point Annotations [63.97144211520869]
We present AdaSyn, a framework for domain adaptive synapse detection with weak point annotations. In the WASPSYN challenge at I SBI 2023, our method ranks the 1st place.
arXiv Detail & Related papers (2023-08-31T05:05:53Z)
BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization [135.73436686653315]
We are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition( SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state.
arXiv Detail & Related papers (2023-02-10T06:23:44Z)
Translation Consistent Semi-supervised Segmentation for 3D Medical Images [25.126639911618994]
3D medical image segmentation methods have been successful, but their dependence on large amounts of voxel-level data is a disadvantage.<n>Semi-supervised learning (SSL) solve this issue by training models with a large unlabelled and a small labelled dataset.<n>We introduce the Translation Consistent Co-training (TraCoCo) which is a consistency learning SSL method.
arXiv Detail & Related papers (2022-03-28T06:31:39Z)
Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)
Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy. MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.