Generation and Detection of Sign Language Deepfakes - A Linguistic and Visual Analysis
- URL: http://arxiv.org/abs/2404.01438v2
- Date: Mon, 17 Feb 2025 18:22:03 GMT
- Title: Generation and Detection of Sign Language Deepfakes - A Linguistic and Visual Analysis
- Authors: Shahzeb Naeem, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Carlos Ivan Colon, Hasan Al-Nashash,
- Abstract summary: This research explores the positive application of deepfake technology for upper body generation, specifically sign language for the Deaf and Hard of Hearing (DHoH) community.
We construct a reliable deepfake dataset, evaluating its technical and visual credibility using computer vision and natural language processing models.
The dataset, consisting of over 1200 videos featuring both seen and unseen individuals, is also used to detect deepfake videos targeting vulnerable individuals.
- Score: 6.189190729240752
- License:
- Abstract: This research explores the positive application of deepfake technology for upper body generation, specifically sign language for the Deaf and Hard of Hearing (DHoH) community. Given the complexity of sign language and the scarcity of experts, the generated videos are vetted by a sign language expert for accuracy. We construct a reliable deepfake dataset, evaluating its technical and visual credibility using computer vision and natural language processing models. The dataset, consisting of over 1200 videos featuring both seen and unseen individuals, is also used to detect deepfake videos targeting vulnerable individuals. Expert annotations confirm that the generated videos are comparable to real sign language content. Linguistic analysis, using textual similarity scores and interpreter evaluations, shows that the interpretation of generated videos is at least 90% similar to authentic sign language. Visual analysis demonstrates that convincingly realistic deepfakes can be produced, even for new subjects. Using a pose/style transfer model, we pay close attention to detail, ensuring hand movements are accurate and align with the driving video. We also apply machine learning algorithms to establish a baseline for deepfake detection on this dataset, contributing to the detection of fraudulent sign language videos.
Related papers
- Signs as Tokens: An Autoregressive Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.
To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts.
These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets.
arXiv Detail & Related papers (2024-11-26T18:28:09Z) - Hindi audio-video-Deepfake (HAV-DF): A Hindi language-based Audio-video Deepfake Dataset [11.164272928464879]
Fake videos or speeches in Hindi can have an enormous impact on rural and semi-urban communities.
This paper aims to create a first novel Hindi deep fake dataset, named Hindi audio-video-Deepfake'' (HAV-DF)
arXiv Detail & Related papers (2024-11-23T05:18:43Z) - Unmasking Illusions: Understanding Human Perception of Audiovisual Deepfakes [49.81915942821647]
This paper aims to evaluate the human ability to discern deepfake videos through a subjective study.
We present our findings by comparing human observers to five state-ofthe-art audiovisual deepfake detection models.
We found that all AI models performed better than humans when evaluated on the same 40 videos.
arXiv Detail & Related papers (2024-05-07T07:57:15Z) - Vulnerability of Automatic Identity Recognition to Audio-Visual
Deepfakes [13.042731289687918]
We present the first realistic audio-visual database of deepfakes SWAN-DF, where lips and speech are well synchronized.
We demonstrate the vulnerability of a state of the art speaker recognition system, such as ECAPA-TDNN-based model from SpeechBrain.
arXiv Detail & Related papers (2023-11-29T14:18:04Z) - DiffSLVA: Harnessing Diffusion Models for Sign Language Video
Anonymization [33.18321022815901]
We introduce DiffSLVA, a novel methodology for text-guided sign language video anonymization.
We develop a specialized module dedicated to capturing facial expressions, which are critical for conveying linguistic information in signed languages.
This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications.
arXiv Detail & Related papers (2023-11-27T18:26:19Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting [28.012212656892746]
We introduce a neural rendering pipeline for transferring the facial expressions, head pose, and body movements of one person in a source video to another in a target video.
Our method can be used for Sign Language Anonymization, Sign Language Production (synthesis module), as well as for reenacting other types of full body activities.
arXiv Detail & Related papers (2022-09-03T18:04:50Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Skeleton Based Sign Language Recognition Using Whole-body Keypoints [71.97020373520922]
Sign language is used by deaf or speech impaired people to communicate.
Skeleton-based recognition is becoming popular that it can be further ensembled with RGB-D based method to achieve state-of-the-art performance.
Inspired by the recent development of whole-body pose estimation citejin 2020whole, we propose recognizing sign language based on the whole-body key points and features.
arXiv Detail & Related papers (2021-03-16T03:38:17Z) - A Comprehensive Study on Deep Learning-based Methods for Sign Language
Recognition [14.714669469867871]
The aim of the present study is to provide insights on sign language recognition, focusing on mapping non-segmented video streams to glosses.
To the best of our knowledge, this is the first sign language dataset where sentence and gloss level annotations are provided for a video capture.
arXiv Detail & Related papers (2020-07-24T14:07:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.