Fugu-MT 論文翻訳(概要): Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?

論文の概要: Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?

arxiv url: http://arxiv.org/abs/2410.24019v1
Date: Thu, 31 Oct 2024 15:20:50 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-28 17:07:42.877412
Title: Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?
Title（参考訳）: 音声は言葉以上のもの:音声からテキストへの翻訳システムは韻律を活用するか?
Authors: Ioannis Tsiamas, Matthias Sperber, Andrew Finch, Sarthak Garg,
Abstract要約: 韻律は音声からテキストへの翻訳システムの中ではほとんど研究されない。エンドツーエンド(E2E)システムは、翻訳決定を行う際に音声信号に直接アクセスする。主な課題は、翻訳における韻律認識を評価することの難しさである。
参考スコア（独自算出の注目度）: 7.682929772871941
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The prosody of a spoken utterance, including features like stress, intonation and rhythm, can significantly affect the underlying semantics, and as a consequence can also affect its textual translation. Nevertheless, prosody is rarely studied within the context of speech-to-text translation (S2TT) systems. In particular, end-to-end (E2E) systems have been proposed as well-suited for prosody-aware translation because they have direct access to the speech signal when making translation decisions, but the understanding of whether this is successful in practice is still limited. A main challenge is the difficulty of evaluating prosody awareness in translation. To address this challenge, we introduce an evaluation methodology and a focused benchmark (named ContraProST) aimed at capturing a wide range of prosodic phenomena. Our methodology uses large language models and controllable text-to-speech (TTS) to generate contrastive examples. Through experiments in translating English speech into German, Spanish, and Japanese, we find that (a) S2TT models possess some internal representation of prosody, but the prosody signal is often not strong enough to affect the translations, (b) E2E systems outperform cascades of speech recognition and text translation systems, confirming their theoretical advantage in this regard, and (c) certain cascaded systems also capture prosodic information in the translation, but only to a lesser extent that depends on the particulars of the transcript's surface form.
Abstract（参考訳）: 音声発声の韻律は、ストレス、イントネーション、リズムなどの特徴を含み、下層のセマンティクスに大きく影響し、結果としてそのテキスト翻訳にも影響を及ぼす。それにもかかわらず、韻律は音声からテキストへの翻訳(S2TT)システムの中ではほとんど研究されない。特にエンド・ツー・エンド(E2E)システムは、翻訳決定の際に音声信号に直接アクセスできるため、韻律対応の翻訳に適しているが、実際にこれが成功するかどうかの理解は限られている。主な課題は、翻訳における韻律認識を評価することの難しさである。この課題に対処するために、幅広い韻律現象を捉えることを目的とした評価手法と集中ベンチマーク(ContraProST)を導入する。提案手法は,大きな言語モデルと制御可能なテキスト音声(TTS)を用いて,対照的な例を生成する。英語をドイツ語、スペイン語、日本語に翻訳する実験を通して、私たちはそれを発見しました。 (a)S2TTモデルは韻律の内部表現を持っているが、韻律信号は翻訳に影響を与えるほど強くないことが多い。 (b)E2Eシステムは音声認識・テキスト翻訳システムのカスケードを上回り、理論的優位性を確認し、 (c)一部のカスケード系も翻訳において韻律的な情報をキャプチャするが、転写文の表面形態の特異性に依存する程度に限られる。

関連論文リスト

Representation Purification for End-to-End Speech Translation [16.967317436711113]
音声からテキストへの変換(英語: Speech-to-text translation, ST)とは、音声を別の言語でテキストに変換する作業である。我々は,コンテンツに依存しない要素とコンテンツ関連要因の組み合わせとして,音声表現を概念化する。
論文参考訳（メタデータ） (2024-12-05T15:50:44Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
カスケード方式で多様なデータセットを活用する新しいモデルフレームワークTransVIPを提案する。本稿では、話者の音声特性と、翻訳過程における音源音声からの等時性を維持するために、2つの分離エンコーダを提案する。フランス語と英語のペアに関する実験により、我々のモデルは、現在最先端の音声音声翻訳モデルよりも優れていることを示した。
論文参考訳（メタデータ） (2024-05-28T04:11:37Z)
Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases [79.07111754406841]
本研究は,韻律が重要な役割を果たす発話を明瞭にするための直接S2TTシステムの能力を評価するために,コントラスト評価を用いることを提案する。本結果は,カスケード翻訳モデルよりも直接翻訳システムの価値を明確に示すものである。
論文参考訳（メタデータ） (2024-02-01T14:46:35Z)
Crossing the Threshold: Idiomatic Machine Translation through Retrieval Augmentation and Loss Weighting [66.02718577386426]
慣用的な翻訳と関連する問題を簡易に評価する。我々は,変圧器をベースとした機械翻訳モデルが慣用的な翻訳に対して正しくデフォルトとなる点を明らかにするための合成実験を行った。自然慣用句の翻訳を改善するために, 単純かつ効果的な2つの手法を導入する。
論文参考訳（メタデータ） (2023-10-10T23:47:25Z)
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
本稿では,多言語多言語音声音声合成のためのテキストレス学習手法を提案する。音声単位を擬似テキストとして扱うことにより、音声の言語内容に焦点を合わせることができる。提案するUTUTモデルは,音声音声合成(S2ST)だけでなく,多言語音声合成(T2S)やテキスト音声合成(T2ST)にも有効であることを示す。
論文参考訳（メタデータ） (2023-08-03T15:47:04Z)
Textless Direct Speech-to-Speech Translation with Discrete Speech Representation [27.182170555234226]
本研究では,テキストの監督なしにエンドツーエンドの直接S2STモデルをトレーニングするための新しいモデルであるTextless Translatotronを提案する。教師なし音声データで事前訓練された音声エンコーダを両方のモデルに使用すると、提案モデルはトランスラトトロン2とほぼ同等の翻訳品質が得られる。
論文参考訳（メタデータ） (2022-10-31T19:48:38Z)
Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
E2E (End-to-end speech-to-text translation) はしばしば、音声認識やテキスト翻訳タスクを通じて、そのエンコーダおよび/またはデコーダをソース転写を用いて事前訓練することに依存する。本稿では,音声翻訳対だけで訓練したE2E STの品質をどの程度改善できるかを考察する。
論文参考訳（メタデータ） (2022-06-09T15:39:19Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeechは、両側摂動を伴う音声から音声への翻訳モデルである。我々は,非自己回帰S2ST手法を構築し,繰り返しマスキングを行い,単位選択を予測する。 TranSpeechは推論遅延を大幅に改善し、自動回帰技術よりも最大21.4倍のスピードアップを実現している。
論文参考訳（メタデータ） (2022-05-25T06:34:14Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。