LipLearner: Customizable Silent Speech Interactions on Mobile Devices
- URL: http://arxiv.org/abs/2302.05907v2
- Date: Tue, 14 Feb 2023 07:56:45 GMT
- Title: LipLearner: Customizable Silent Speech Interactions on Mobile Devices
- Authors: Zixiong Su, Shitao Fang, Jun Rekimoto
- Abstract summary: We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort.
Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset.
A user study demonstrated that with LipLearner, users could define their own commands with high reliability guaranteed by an online incremental learning scheme.
- Score: 15.445920726854595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Silent speech interface is a promising technology that enables private
communications in natural language. However, previous approaches only support a
small and inflexible vocabulary, which leads to limited expressiveness. We
leverage contrastive learning to learn efficient lipreading representations,
enabling few-shot command customization with minimal user effort. Our model
exhibits high robustness to different lighting, posture, and gesture conditions
on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947
is achievable only using one shot, and its performance can be further boosted
by adaptively learning from more data. This generalizability allowed us to
develop a mobile silent speech interface empowered with on-device fine-tuning
and visual keyword spotting. A user study demonstrated that with LipLearner,
users could define their own commands with high reliability guaranteed by an
online incremental learning scheme. Subjective feedback indicated that our
system provides essential functionalities for customizable silent speech
interactions with high usability and learnability.
Related papers
- SpeechLess: Micro-utterance with Personalized Spatial Memory-aware Assistant in Everyday Augmented Reality [6.523396381538382]
SpeechLess is a wearable AR assistant that introduces a speech-based intent control paradigm grounded in personalized spatial memory.<n>Our results indicate that SpeechLess can improve everyday information access, reduce articulation effort, and support socially acceptable use without substantially degrading perceived usability or intent resolution accuracy across diverse everyday environments.
arXiv Detail & Related papers (2026-01-31T16:01:32Z) - BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs [84.59993864748195]
We propose a new paradigm inspired by operationalism'' that decouples instruction understanding from speech generation.<n>We introduce BatonVoice, a framework where an LLM acts as a conductor'', understanding user instructions.<n>A separate TTS model, the orchestra'', then generates the speech from these features.
arXiv Detail & Related papers (2025-09-30T16:52:14Z) - Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio [52.859261069569165]
We propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z) - Real-Time Sign Language Gestures to Speech Transcription using Deep Learning [0.0]
This project introduces a real-time assistive technology solution that leverages advanced deep learning techniques to translate sign language gestures into textual and audible speech.<n>By employing convolution neural networks (CNN) trained on the Sign Language MNIST dataset, the system accurately classifies hand gestures captured live via webcam.
arXiv Detail & Related papers (2025-08-18T08:25:18Z) - ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z) - In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties [24.74769794165231]
We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal.<n>We find that as few as 12 example utterances at inference time reduce word error rates by a relative 19.7%.<n>Overall, we find that our novel ICL adaptation scheme reveals a similar performance profile to human listeners.
arXiv Detail & Related papers (2025-05-20T20:20:37Z) - Towards Developmentally Plausible Rewards: Communicative Success as a Learning Signal for Interactive Language Models [49.22720751953838]
We propose a method for training language models in an interactive setting inspired by child language acquisition.<n>In our setting, a speaker attempts to communicate some information to a listener in a single-turn dialogue and receives a reward if communicative success is achieved.
arXiv Detail & Related papers (2025-05-09T11:48:36Z) - Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks.
This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning.
We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [152.41217651729738]
GPT-4o is an omni-modal model that enables vocal conversations with diverse emotions and tones.
We propose EMOVA to enable Large Language Models with end-to-end speech capabilities.
For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks.
arXiv Detail & Related papers (2024-09-26T16:44:02Z) - A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech
Enhancement [16.900731393703648]
Self-supervised learned models have been found to be very effective for certain speech tasks.
In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions.
arXiv Detail & Related papers (2024-03-03T02:05:17Z) - Instruction-Following Speech Recognition [21.591086644665197]
We introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions.
Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring Large Language Models or pre-trained speech modules.
arXiv Detail & Related papers (2023-09-18T14:59:10Z) - Bootstrapping meaning through listening: Unsupervised learning of spoken
sentence embeddings [4.582129557845177]
This study tackles the unsupervised learning of semantic representations for spoken utterances.
We propose WavEmbed, a sequential autoencoder that predicts hidden units from a dense representation of speech.
We also propose S-HuBERT to induce meaning through knowledge distillation.
arXiv Detail & Related papers (2022-10-23T21:16:09Z) - Learning to Decompose Visual Features with Latent Textual Prompts [140.2117637223449]
We propose Decomposed Feature Prompting (DeFo) to improve vision-language models.
Our empirical study shows DeFo's significance in improving the vision-language models.
arXiv Detail & Related papers (2022-10-09T15:40:13Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen
Language Models [57.557319372969495]
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks.
Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings.
We propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model.
arXiv Detail & Related papers (2022-03-29T19:08:55Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Self-Supervised Learning for Personalized Speech Enhancement [25.05285328404576]
Speech enhancement systems can show improved performance by adapting the model towards a single test-time speaker.
Test-time user might only provide a small amount of noise-free speech data, likely insufficient for traditional fully-supervised learning.
We propose self-supervised methods that are designed specifically to learn personalized and discriminative features from abundant in-the-wild noisy, but still personal speech recordings.
arXiv Detail & Related papers (2021-04-05T17:12:51Z) - Improving Device Directedness Classification of Utterances with Semantic
Lexical Features [9.43023411463796]
We propose a directedness classifier that combines semantic lexical features with a lightweight acoustic feature.
The mixed-domain lexical and acoustic feature model is able to achieve 14% relative reduction of EER over a state-of-the-art acoustic-only baseline model.
arXiv Detail & Related papers (2020-09-29T20:13:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.