Fugu-MT 論文翻訳(概要): Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching

論文の概要: Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching

arxiv url: http://arxiv.org/abs/2606.11925v1
Date: Wed, 10 Jun 2026 10:56:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.423109
Title: Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching
Title（参考訳）: LLM-Guided Video Stitchingによる手話翻訳のためのコーパス強化
Authors: Zsolt Robotka, Ádám Rák, Jalal Al-Afandi, András Horváth, György Cserey,
Abstract要約: 署名言語翻訳は、アクセシビリティを改善し、署名と署名しないコミュニティ間のコミュニケーションを可能にするという約束を持っている。大規模な弱い整列データセットにより、スケールでの事前トレーニングが可能となり、gloss-freeメソッドはエキスパートアノテーションへの依存を減らした。本研究では,人間のアノテーション,外部手話ビデオコーパス,生成ビデオモデルを必要としないコーパス拡張手法を提案する。我々の拡張は、同じフレームワーク内で適用され、アーキテクチャやトレーニングプロトコルを変更することなく、+2.92 BLEU-4を達成する。
参考スコア（独自算出の注目度）: 0.16792862237830142
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sign language translation (SLT) converts sign language video into spoken language text and holds significant promise for improving accessibility and enabling communication between signing and non-signing communities. While large weakly-aligned datasets have enabled pre-training at scale and gloss-free methods have reduced reliance on expert annotation, high-quality parallel sign video-text pairs for fine-tuning remain scarce, limiting generalisation on long-tail vocabulary and unseen constructions. We propose a corpus augmentation approach that requires no additional human annotation, external sign-language video corpora, or generative video models, relying only on the existing gloss-annotated training corpus and an LLM for sentence generation: per-gloss clips are extracted from training videos via CTC forced-alignment, novel gloss-sentence pairs are generated by a corpus-anchored LLM, and synthetic sequences are assembled through random sentence sampling and clip assignment. The resulting synthetic RGB video-text pairs are architecture-agnostic at the downstream training stage and can be consumed directly by RGB-based SLT models, or converted into pose or feature representations by pipelines that derive such inputs from video. Sincan et al. re-evaluated five recent gloss-free methods under strictly identical conditions; the largest verified gain over the GFSLT-VLP baseline was only 0.98 BLEU-4. Our augmentation, applied within the same framework, achieves +2.92 BLEU-4 without any change to architecture or training protocol. We further identify that synthetic data harms vision-language pretraining despite improving its objectives, and that optimising clip transitions for visual smoothness is counter-productive under L2-based criteria; we propose that abrupt boundaries may act as a form of implicit regularisation. Code is available at https://github.com/robizso/slt-datagen.
Abstract（参考訳）: 手話翻訳(SLT)は手話動画を音声言語テキストに変換し、アクセシビリティを改善し、署名と非署名のコミュニティ間のコミュニケーションを可能にするための重要な約束を持っている。大きな弱い整列したデータセットは、スケールでの事前トレーニングを可能にし、光沢のない手法は専門家のアノテーションへの依存を減らしたが、高品質の並列手話ビデオテキストペアは依然として不足しており、長い尾の語彙や目に見えない構造への一般化が制限されている。そこで本研究では,CTC強制アライメントによるトレーニングビデオから声帯あたりのクリップを抽出し,コーパスアンコレ LLM によって新しい声帯-文対を生成し,ランダムな文のサンプリングとクリップの割り当てによって合成シーケンスを組み立てることにより,人間のアノテーション,外部手話ビデオコーパス,あるいは生成ビデオモデルの追加を必要としないコーパス拡張アプローチを提案する。合成されたRGBビデオテキストペアは、下流のトレーニング段階ではアーキテクチャに依存しず、RGBベースのSLTモデルによって直接消費されるか、ビデオからそのような入力を導出するパイプラインによってポーズや特徴表現に変換される。 Sincanらは、GFSLT-VLPベースラインに対する最も証明された利得はわずか0.98 BLEU-4である。我々の拡張は、同じフレームワーク内で適用され、アーキテクチャやトレーニングプロトコルを変更することなく、+2.92 BLEU-4を達成する。さらに, 合成データは, 目的の改善にもかかわらず, 視覚言語による事前学習を損なうこと, 映像の滑らかさに対するクリップ遷移の最適化はL2基準では非生産的であること, 急激な境界が暗黙的正規化の一形態として機能することを提案する。コードはhttps://github.com/robizso/slt-datagen.comで入手できる。

論文の概要: Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching

関連論文リスト