Generation and Detection of Sign Language Deepfakes - A Linguistic and Visual Analysis
- URL: http://arxiv.org/abs/2404.01438v2
- Date: Mon, 17 Feb 2025 18:22:03 GMT
- Title: Generation and Detection of Sign Language Deepfakes - A Linguistic and Visual Analysis
- Authors: Shahzeb Naeem, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Carlos Ivan Colon, Hasan Al-Nashash,
- Abstract summary: This research explores the positive application of deepfake technology for upper body generation, specifically sign language for the Deaf and Hard of Hearing (DHoH) community.<n>We construct a reliable deepfake dataset, evaluating its technical and visual credibility using computer vision and natural language processing models.<n>The dataset, consisting of over 1200 videos featuring both seen and unseen individuals, is also used to detect deepfake videos targeting vulnerable individuals.
- Score: 6.189190729240752
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This research explores the positive application of deepfake technology for upper body generation, specifically sign language for the Deaf and Hard of Hearing (DHoH) community. Given the complexity of sign language and the scarcity of experts, the generated videos are vetted by a sign language expert for accuracy. We construct a reliable deepfake dataset, evaluating its technical and visual credibility using computer vision and natural language processing models. The dataset, consisting of over 1200 videos featuring both seen and unseen individuals, is also used to detect deepfake videos targeting vulnerable individuals. Expert annotations confirm that the generated videos are comparable to real sign language content. Linguistic analysis, using textual similarity scores and interpreter evaluations, shows that the interpretation of generated videos is at least 90% similar to authentic sign language. Visual analysis demonstrates that convincingly realistic deepfakes can be produced, even for new subjects. Using a pose/style transfer model, we pay close attention to detail, ensuring hand movements are accurate and align with the driving video. We also apply machine learning algorithms to establish a baseline for deepfake detection on this dataset, contributing to the detection of fraudulent sign language videos.
Related papers
- Detecting Deepfake Talking Heads from Facial Biometric Anomalies [12.369423169349673]
Deepfake video impersonations are often used to power frauds, scams, and political disinformation.<n>We propose a novel machine learning technique for the detection of deepfake video impersonations that leverages unnatural patterns in facial biometrics.<n>We evaluate this technique across a large dataset of deepfake techniques and impersonations, as well as assess its reliability to video laundering and its generalization to previously unseen video deepfake generators.
arXiv Detail & Related papers (2025-07-11T16:29:25Z) - Tell me Habibi, is it Real or Fake? [15.344187517040508]
Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication.<n>We introduce textbfArEnAV, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content.<n>Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection.
arXiv Detail & Related papers (2025-05-28T16:54:36Z) - ExDDV: A New Dataset for Explainable Deepfake Detection in Video [23.169975307069066]
We introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video.
We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies.
Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos.
arXiv Detail & Related papers (2025-03-18T16:55:07Z) - Understanding Audiovisual Deepfake Detection: Techniques, Challenges, Human Factors and Perceptual Insights [49.81915942821647]
Deep Learning has been successfully applied in diverse fields, and its impact on deepfake detection is no exception.
Deepfakes are fake yet realistic synthetic content that can be used deceitfully for political impersonation, phishing, slandering, or spreading misinformation.
This paper aims to improve the effectiveness of deepfake detection strategies and guide future research in cybersecurity and media integrity.
arXiv Detail & Related papers (2024-11-12T09:02:11Z) - Scaling up Multimodal Pre-training for Sign Language Understanding [96.17753464544604]
Sign language serves as the primary meaning of communication for the deaf-mute community.
To facilitate communication between the deaf-mute and hearing people, a series of sign language understanding (SLU) tasks have been studied.
These tasks investigate sign language topics from diverse perspectives and raise challenges in learning effective representation of sign language videos.
arXiv Detail & Related papers (2024-08-16T06:04:25Z) - Sign Language Recognition Based On Facial Expression and Hand Skeleton [2.5879170041667523]
We propose a sign language recognition network that integrates skeleton features of hands and facial expression.
By incorporating facial expression information, the accuracy and robustness of sign language recognition are improved.
arXiv Detail & Related papers (2024-07-02T13:02:51Z) - Unmasking Illusions: Understanding Human Perception of Audiovisual Deepfakes [49.81915942821647]
This paper aims to evaluate the human ability to discern deepfake videos through a subjective study.
We present our findings by comparing human observers to five state-ofthe-art audiovisual deepfake detection models.
We found that all AI models performed better than humans when evaluated on the same 40 videos.
arXiv Detail & Related papers (2024-05-07T07:57:15Z) - Testing MediaPipe Holistic for Linguistic Analysis of Nonmanual Markers in Sign Languages [0.6691151987390147]
We compare MediaPipe Holistic (MPH) to an older solution (OpenFace, OF)
MPH does not perform well enough for linguistic analysis of eyebrow movement.
We reiterate a previous proposal to train additional correction models to overcome these limitations.
arXiv Detail & Related papers (2024-03-15T14:59:21Z) - Video as the New Language for Real-World Decision Making [100.68643056416394]
Video data captures important information about the physical world that is difficult to express in language.
Video can serve as a unified interface that can absorb internet knowledge and represent diverse tasks.
We identify major impact opportunities in domains such as robotics, self-driving, and science.
arXiv Detail & Related papers (2024-02-27T02:05:29Z) - Vulnerability of Automatic Identity Recognition to Audio-Visual
Deepfakes [13.042731289687918]
We present the first realistic audio-visual database of deepfakes SWAN-DF, where lips and speech are well synchronized.
We demonstrate the vulnerability of a state of the art speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain.
arXiv Detail & Related papers (2023-11-29T14:18:04Z) - DiffSLVA: Harnessing Diffusion Models for Sign Language Video
Anonymization [33.18321022815901]
We introduce DiffSLVA, a novel methodology for text-guided sign language video anonymization.
We develop a specialized module dedicated to capturing facial expressions, which are critical for conveying linguistic information in signed languages.
This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications.
arXiv Detail & Related papers (2023-11-27T18:26:19Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting [28.012212656892746]
We introduce a neural rendering pipeline for transferring the facial expressions, head pose, and body movements of one person in a source video to another in a target video.
Our method can be used for Sign Language Anonymization, Sign Language Production (synthesis module), as well as for reenacting other types of full body activities.
arXiv Detail & Related papers (2022-09-03T18:04:50Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded
Language from Percepts and Raw Speech [26.076534338576234]
Learning to understand grounded language, which connects natural language to percepts, is a critical research area.
In this work we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs.
arXiv Detail & Related papers (2021-12-27T16:12:30Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Skeleton Based Sign Language Recognition Using Whole-body Keypoints [71.97020373520922]
Sign language is used by deaf or speech impaired people to communicate.
Skeleton-based recognition is becoming popular that it can be further ensembled with RGB-D based method to achieve state-of-the-art performance.
Inspired by the recent development of whole-body pose estimation citejin 2020whole, we propose recognizing sign language based on the whole-body key points and features.
arXiv Detail & Related papers (2021-03-16T03:38:17Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - A Comprehensive Study on Deep Learning-based Methods for Sign Language
Recognition [14.714669469867871]
The aim of the present study is to provide insights on sign language recognition, focusing on mapping non-segmented video streams to glosses.
To the best of our knowledge, this is the first sign language dataset where sentence and gloss level annotations are provided for a video capture.
arXiv Detail & Related papers (2020-07-24T14:07:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.