Fugu-MT 論文翻訳(概要): Towards Continuous Sign Language Conversation from Isolated Signs

論文の概要: Towards Continuous Sign Language Conversation from Isolated Signs

arxiv url: http://arxiv.org/abs/2605.14705v1
Date: Thu, 14 May 2026 11:22:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.791961
Title: Towards Continuous Sign Language Conversation from Isolated Signs
Title（参考訳）: 孤立した手話からの連続手話会話に向けて
Authors: Youngmin Kim, Kyobin Choo, Jiwoo Park, Minseo Kim, Chanyoung Kim, Junhyeok Kim, Seong Jae Hwang,
Abstract要約: 本稿では,これまでで最大のラベル付き孤立符号語彙であるSignaVox-Wと,連続的な3次元手話データセットであるSignaVox-Uを紹介する。得られたデータを用いて、サイン・ツー・サインの直接対話モデルであるSignaVoxをトレーニングし、事前の署名コンテキストから3次元の身体、手、顔の動き応答を生成する。
参考スコア（独自算出の注目度）: 15.139358499214529
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.
Abstract（参考訳）: 手話は、多くの難聴者や難聴者(DHH)の署名者にとって主要な言語であるが、ほとんどの会話型AIシステムは、まだ音声や文字による対話を仲介している。この音声言語中心のインタフェースは、話や書き言葉が最もアクセスしやすい媒体ではないシグナーへのアクセスを制限することができ、直接のシグナー・トゥ・シグナー・モデリングを動機付けている。しかし、文レベルの手話ビデオデータは収集・注釈するのに費用がかかるため、既存の手話翻訳と生産モデルは語彙に制限があり、オープンドメインの一般化が弱いままである。大規模ラベル付き孤立クリップは、語彙的に接地された動きプリミティブとして収集され、既存の対話コーパスから派生した手話順の発話に再合成される。また,SignaVox-Wを用いた連続3次元手話データセットであるSignaVox-Uについても紹介した。音声と署名された言語間の構造的ミスマッチをブリッジするために、検索誘導音声-グロス変換器を使用し、独立に収集された孤立したクリップをブリッジするために、持続的アライメントと共調境界の塗装を行う拡散変換器であるBRAIDを提案する。得られたデータを用いて,音声テキストを使わずに事前署名コンテキストから3次元体,手,顔の動作応答を生成する,直接サイン・ツー・サインの対話モデルSignaVoxを訓練する。定量的および定性的な評価は、分離された連続的な動きの質の向上、より強い応答レベルのセマンティックアライメント、そして視覚空間の調音をより良くサポートするスケーラブルなシグナー中心の相互作用を示す。

論文の概要: Towards Continuous Sign Language Conversation from Isolated Signs

関連論文リスト