Fugu-MT 論文翻訳(概要): CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

論文の概要: CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

arxiv url: http://arxiv.org/abs/2303.05725v4
Date: Wed, 12 Apr 2023 10:07:11 GMT
ステータス: 翻訳完了
システム内更新日: 2023-04-13 17:51:58.706734
Title: CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment
Title（参考訳）: cvt-slr:可変アライメントを用いた手話認識のためのコントラスト的視覚テキスト変換
Authors: Jiangbin Zheng, Yile Wang, Cheng Tan, Siyuan Li, Ge Wang, Jun Xia, Yidong Chen, Stan Z. Li
Abstract要約: 手話認識(SLR)は、手話ビデオにテキストグルースとして注釈をつける弱い教師付きタスクである。近年の研究では、大規模手話データセットの欠如による訓練不足がSLRの主なボトルネックとなっている。視覚と言語の両方のモダリティの事前訓練された知識を十分に探求するために,SLR,-SLRのための新しいコントラッシブ・ビジュアル・トランスフォーメーションを提案する。
参考スコア（独自算出の注目度）: 42.10603331311837
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. Most SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is designed to explicitly enhance the consistency constraints. Extensive experiments on public datasets (PHOENIX-2014 and PHOENIX-2014T) demonstrate that our proposed CVT-SLR consistently outperforms existing single-cue methods and even outperforms SOTA multi-cue methods.
Abstract（参考訳）: 手話認識(SLR)は、手話ビデオにテキストグルースとして注釈をつける弱い教師付きタスクである。近年の研究では、大規模手話データセットの欠如による訓練不足がSLRの主なボトルネックとなっている。ほとんどのslrは事前訓練されたビジュアルモジュールを採用し、2つのメインストリームソリューションを開発する。マルチストリームアーキテクチャはマルチキューの視覚的特徴を拡張し、現在のSOTA性能を得るが、複雑な設計を必要とし、潜在的なノイズをもたらす可能性がある。あるいは、視覚とテキスト間の明示的なクロスモーダルアライメントを用いた先進的なシングルキューslrフレームワークはシンプルで効果的であり、マルチキューフレームワークと競合する可能性がある。本研究では,SLR(CVT-SLR)に対して,視覚的・言語的モダリティの事前知識を十分に探求するための,新しいコントラッシブ・テキスト変換を提案する。単一キューのクロスモーダルアライメントフレームワークをベースとして,事前学習した文脈知識に対して,完全な事前学習言語モジュールを導入しながら可変オートエンコーダ(VAE)を提案する。 VAEは、従来のコンテキストモジュールとしての事前訓練されたコンテキスト知識の恩恵を受けながら、視覚的およびテキスト的モダリティを暗黙的に調整する。一方、整合性制約を明確に拡張するために、対照的なクロスモーダルアライメントアルゴリズムが設計されている。公開データセット(PHOENIX-2014およびPHOENIX-2014T)の大規模な実験により,提案したCVT-SLRは既存の単一キュー法より一貫して優れ,SOTAマルチキュー法よりも優れていた。

論文の概要: CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

関連論文リスト