Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition
- URL: http://arxiv.org/abs/2508.00391v1
- Date: Fri, 01 Aug 2025 07:40:39 GMT
- Title: Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition
- Authors: Guanjie Huang, Danny H. K. Tsang, Shan Yang, Guangzhi Lei, Li Liu,
- Abstract summary: Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments.<n> Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods.<n>We propose the first collaborative multi-agent system for ACSR, named Cued-Agent.
- Score: 17.451829471077858
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi-agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi-agent system for ACSR, named Cued-Agent. It integrates four specialized sub-agents: a Multimodal Large Language Model-based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer-based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training-free manner, and a Self-Correction Phoneme-to-Word agent that enables post-process and end-to-end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing-impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued-Agent performs superbly in both normal and hearing-impaired scenarios compared with state-of-the-art methods. The implementation is available at https://github.com/DennisHgj/Cued-Agent.
Related papers
- AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction [70.60422261117816]
We propose a new framework that rethinks multi-agent coordination through a sequential structure rather than a graph structure.<n>Our method focuses on two key directions: (1) Next-Agent Prediction, which selects the most suitable agent role at each step, and (2) Next-Context Selection, which enables each agent to selectively access relevant information from any previous step.
arXiv Detail & Related papers (2025-06-21T18:34:43Z) - Chain-of-Thought Training for Open E2E Spoken Dialogue Systems [57.77235760292348]
End-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information.<n>We propose a chain-of-thought (CoT) formulation to ensure that training on conversational data remains closely aligned with the multimodal language model.<n>Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets.
arXiv Detail & Related papers (2025-05-31T21:43:37Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.<n>We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Computation and Parameter Efficient Multi-Modal Fusion Transformer for
Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people.
automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z) - SpeechAgents: Human-Communication Simulation with Multi-Modal
Multi-Agent Systems [53.94772445896213]
Large Language Model (LLM)-based multi-agent systems have demonstrated promising performance in simulating human society.
We propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication.
arXiv Detail & Related papers (2024-01-08T15:01:08Z) - TESS: A Multi-intent Parser for Conversational Multi-Agent Systems with
Decentralized Natural Language Understanding Models [6.470108226184637]
Multi-agent systems complicate the natural language understanding of user intents.
We propose an efficient parsing and orchestration pipeline algorithm to service multi-intent utterances from the user.
arXiv Detail & Related papers (2023-12-19T03:39:23Z) - Cross-Modal Mutual Learning for Cued Speech Recognition [10.225972737967249]
We propose a transformer based cross-modal mutual learning framework to prompt multi-modal interaction.
Our model forces modality-specific information of different modalities to pass through a modality-invariant codebook.
We establish a novel large-scale multi-speaker CS dataset for Mandarin Chinese.
arXiv Detail & Related papers (2022-12-02T10:45:33Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.