Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings
- URL: http://arxiv.org/abs/2512.20204v1
- Date: Tue, 23 Dec 2025 09:56:23 GMT
- Title: Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings
- Authors: Marko Čechovič, Natália Komorníková, Dominik Macháček, Ondřej Bojar,
- Abstract summary: We present a corpus of cross-lingual dialogues facilitated by automatic simultaneous speech translation.<n>The corpus consists of 5 hours of speech recordings with ASR and gold transcripts in 12 original languages and automatic and corrected translations into English.<n>For an overview of this task and its complexity, we attempt to quantify misunderstandings in cross-lingual meetings.
- Score: 0.45498315114762483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech processing and translation technology have the potential to facilitate meetings of individuals who do not share any common language. To evaluate automatic systems for such a task, a versatile and realistic evaluation corpus is needed. Therefore, we create and present a corpus of cross-lingual dialogues between individuals without a common language who were facilitated by automatic simultaneous speech translation. The corpus consists of 5 hours of speech recordings with ASR and gold transcripts in 12 original languages and automatic and corrected translations into English. For the purposes of research into cross-lingual summarization, our corpus also includes written summaries (minutes) of the meetings. Moreover, we propose automatic detection of misunderstandings. For an overview of this task and its complexity, we attempt to quantify misunderstandings in cross-lingual meetings. We annotate misunderstandings manually and also test the ability of current large language models to detect them automatically. The results show that the Gemini model is able to identify text spans with misunderstandings with recall of 77% and precision of 47%.
Related papers
- Characterizing Language Use in a Collaborative Situated Game [47.38055058236005]
We collect a corpus of 11.5 hours of spoken human dialogue in the co-op mode of the popular Portal 2 virtual puzzle game.<n>We analyze player language and behavior, identifying a number of linguistic phenomena that rarely appear in most existing chitchat or task-oriented dialogue corpora.
arXiv Detail & Related papers (2025-12-03T02:29:53Z) - LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention [2.199918533021483]
The overlap between vocal traits such as accent, vocal anatomy, and a language's phonetic structure complicates separating linguistic and speaker information.<n>Disentangling these components can significantly improve speaker recognition accuracy.<n>We propose a novel disentanglement learning strategy that integrates joint learning through prefix-tuned cross-attention.
arXiv Detail & Related papers (2025-06-02T10:59:31Z) - Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Multilingual Auxiliary Tasks Training: Bridging the Gap between
Languages for Zero-Shot Transfer of Hate Speech Detection Models [3.97478982737167]
We show how hate speech detection models benefit from a cross-lingual knowledge proxy brought by auxiliary tasks fine-tuning.
We propose to train on multilingual auxiliary tasks to improve zero-shot transfer of hate speech detection models across languages.
arXiv Detail & Related papers (2022-10-24T08:26:51Z) - Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues [7.8378818005171125]
Given a large-scale dialogue data set in one language, we can automatically produce an effective semantic for other languages using machine translation.
We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values.
We show that the succinct representation reduces the compounding effect of translation errors.
arXiv Detail & Related papers (2021-11-04T01:08:14Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Lost in Interpreting: Speech Translation from Source or Interpreter? [0.0]
We release 10 hours of recordings and transcripts of European Parliament speeches in English, with simultaneous interpreting into Czech and German.
We evaluate quality and latency of speaker-based and interpreter-based spoken translation systems from English to Czech.
arXiv Detail & Related papers (2021-06-17T09:32:49Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.