Dialogs Re-enacted Across Languages
- URL: http://arxiv.org/abs/2211.11584v2
- Date: Thu, 13 Jul 2023 02:01:13 GMT
- Title: Dialogs Re-enacted Across Languages
- Authors: Nigel G. Ward, Jonathan E. Avila, Emilia Rivas, Divette Marco
- Abstract summary: We present a protocol for collecting closely matched pairs of utterances across languages.
This report is intended for: people using this corpus, people extending this corpus, and people designing similar collections of bilingual dialog data.
- Score: 2.5425323889482336
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: To support machine learning of cross-language prosodic mappings and other
ways to improve speech-to-speech translation, we present a protocol for
collecting closely matched pairs of utterances across languages, a description
of the resulting data collection and its public release, and some observations
and musings. This report is intended for: people using this corpus, people
extending this corpus, and people designing similar collections of bilingual
dialog data.
Related papers
- Towards a Deep Understanding of Multilingual End-to-End Speech
Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages.
We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z) - Enhancing Cross-lingual Transfer via Phonemic Transcription Integration [57.109031654219294]
PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer.
Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
arXiv Detail & Related papers (2023-07-10T06:17:33Z) - Towards cross-language prosody transfer for dialog [3.3758186776249928]
Speech-to-speech translation systems do not adequately support use for dialog purposes.
In particular, nuances of speaker intent and stance can be lost due to improper prosody transfer.
We develop a data collection protocol in which bilingual speakers re-enact utterances from an earlier conversation in their other language.
arXiv Detail & Related papers (2023-07-09T08:32:14Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - GupShup: An Annotated Corpus for Abstractive Summarization of
Open-Domain Code-Switched Conversations [28.693328393260906]
We introduce abstractive summarization of Hindi-English code-switched conversations and develop the first code-switched conversation summarization dataset.
GupShup contains over 6,831 conversations in Hindi-English and their corresponding human-annotated summaries in English and Hindi-English.
We train state-of-the-art abstractive summarization models and report their performances using both automated metrics and human evaluation.
arXiv Detail & Related papers (2021-04-17T15:42:01Z) - The Multilingual TEDx Corpus for Speech Recognition and Translation [30.993199499048824]
We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages.
The corpus is a collection of audio recordings from TEDx talks in 8 source languages.
We segment transcripts into sentences and align them to the source-language audio and target-language translations.
arXiv Detail & Related papers (2021-02-02T21:16:25Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Mapping Languages: The Corpus of Global Language Use [0.0]
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
In total, the corpus contains 423 billion words representing 148 languages and 158 countries.
arXiv Detail & Related papers (2020-04-02T03:42:14Z) - Improving cross-lingual model transfer by chunking [2.4967521096920686]
We present a guided cross-lingual model transfer approach to address the syntactic differences between source and target languages.
We assume the chunks or phrases in a sentence as transfer units in order to address the differences in ordering of words in the phrases and the ordering of phrases in a sentence separately.
arXiv Detail & Related papers (2020-02-27T14:02:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.