Synthetic Speaking Children -- Why We Need Them and How to Make Them
- URL: http://arxiv.org/abs/2311.06307v1
- Date: Wed, 8 Nov 2023 22:58:22 GMT
- Title: Synthetic Speaking Children -- Why We Need Them and How to Make Them
- Authors: Muhammad Ali Farooq and Dan Bigioi and Rishabh Jain and Wang Yao and
Mariam Yiwere and Peter Corcoran
- Abstract summary: We show how StyleGAN2 can be finetuned to create a gender balanced dataset of children's faces.
By combining generative text to speech models for child voice synthesis and a 3D landmark based talking heads pipeline, we can generate highly realistic, entirely synthetic, talking child video clips.
- Score: 3.1367597377725502
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contemporary Human Computer Interaction (HCI) research relies primarily on
neural network models for machine vision and speech understanding of a system
user. Such models require extensively annotated training datasets for optimal
performance and when building interfaces for users from a vulnerable population
such as young children, GDPR introduces significant complexities in data
collection, management, and processing. Motivated by the training needs of an
Edge AI smart toy platform this research explores the latest advances in
generative neural technologies and provides a working proof of concept of a
controllable data generation pipeline for speech driven facial training data at
scale. In this context, we demonstrate how StyleGAN2 can be finetuned to create
a gender balanced dataset of children's faces. This dataset includes a variety
of controllable factors such as facial expressions, age variations, facial
poses, and even speech-driven animations with realistic lip synchronization. By
combining generative text to speech models for child voice synthesis and a 3D
landmark based talking heads pipeline, we can generate highly realistic,
entirely synthetic, talking child video clips. These video clips can provide
valuable, and controllable, synthetic training data for neural network models,
bridging the gap when real data is scarce or restricted due to privacy
regulations.
Related papers
- Speech Emotion Recognition under Resource Constraints with Data Distillation [64.36799373890916]
Speech emotion recognition (SER) plays a crucial role in human-computer interaction.
The emergence of edge devices in the Internet of Things presents challenges in constructing intricate deep learning models.
We propose a data distillation framework to facilitate efficient development of SER models in IoT applications.
arXiv Detail & Related papers (2024-06-21T13:10:46Z) - ChildDiffusion: Unlocking the Potential of Generative AI and Controllable Augmentations for Child Facial Data using Stable Diffusion and Large Language Models [1.1470070927586018]
The framework is validated by rendering high-quality child faces representing ethnicity data, micro expressions, face pose variations, eye blinking effects, different hair colours and styles, aging, multiple and different child gender subjects in a single frame.
The proposed method circumvents common issues encountered in generative AI tools, such as temporal inconsistency and limited control over the rendered outputs.
arXiv Detail & Related papers (2024-06-17T14:37:14Z) - Video as the New Language for Real-World Decision Making [100.68643056416394]
Video data captures important information about the physical world that is difficult to express in language.
Video can serve as a unified interface that can absorb internet knowledge and represent diverse tasks.
We identify major impact opportunities in domains such as robotics, self-driving, and science.
arXiv Detail & Related papers (2024-02-27T02:05:29Z) - Learning Human Action Recognition Representations Without Real Humans [66.61527869763819]
We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model.
We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks.
Our approach outperforms previous baselines by up to 5%.
arXiv Detail & Related papers (2023-11-10T18:38:14Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - Image Captions are Natural Prompts for Text-to-Image Models [70.30915140413383]
We analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts.
We propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data.
Our method significantly improves the performance of models trained on synthetic training data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z) - Procedural Humans for Computer Vision [1.9550079119934403]
We build a parametric model of the face and body, including articulated hands, to generate realistic images of humans based on this body model.
We show that this can be extended to include the full body by building on the pipeline of Wood et al. to generate synthetic images of humans in their entirety.
arXiv Detail & Related papers (2023-01-03T15:44:48Z) - Knowledge Transfer For On-Device Speech Emotion Recognition with Neural
Structured Learning [19.220263739291685]
Speech emotion recognition (SER) has been a popular research topic in human-computer interaction (HCI)
We propose a neural structured learning (NSL) framework through building synthesized graphs.
Our experiments demonstrate that training a lightweight SER model on the target dataset with speech samples and graphs can not only produce small SER models, but also enhance the model performance.
arXiv Detail & Related papers (2022-10-26T18:38:42Z) - Fake It Till You Make It: Face analysis in the wild using synthetic data
alone [9.081019005437309]
We show that it is possible to perform face-related computer vision in the wild using synthetic data alone.
We describe how to combine a procedurally-generated 3D face model with a comprehensive library of hand-crafted assets to render training images with unprecedented realism.
arXiv Detail & Related papers (2021-09-30T13:07:04Z) - Visual Distant Supervision for Scene Graph Generation [66.10579690929623]
Scene graph models usually require supervised learning on large quantities of labeled data with intensive human annotation.
We propose visual distant supervision, a novel paradigm of visual relation learning, which can train scene graph models without any human-labeled data.
Comprehensive experimental results show that our distantly supervised model outperforms strong weakly supervised and semi-supervised baselines.
arXiv Detail & Related papers (2021-03-29T06:35:24Z) - A Visuospatial Dataset for Naturalistic Verb Learning [18.654373173232205]
We introduce a new dataset for training and evaluating grounded language models.
Our data is collected within a virtual reality environment and is designed to emulate the quality of language data to which a pre-verbal child is likely to have access.
We use the collected data to compare several distributional semantics models for verb learning.
arXiv Detail & Related papers (2020-10-28T20:47:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.