Building and curating conversational corpora for diversity-aware
language science and technology
- URL: http://arxiv.org/abs/2203.03399v2
- Date: Thu, 10 Mar 2022 09:01:56 GMT
- Title: Building and curating conversational corpora for diversity-aware
language science and technology
- Authors: Andreas Liesenfeld, Mark Dingemanse
- Abstract summary: We build a maximally natural data set of conversational interaction that covers 66 languages and varieties from 32 phyla.
We describe the curation and compilation process moving from diverse language documentation corpora to a unified format.
We conclude with two case studies of how diverse data sets can inform interactional linguistics and speech recognition technology.
- Score: 0.15229257192293202
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a pipeline and tools to build a maximally natural data set of
conversational interaction that covers 66 languages and varieties from 32
phyla. We describe the curation and compilation process moving from diverse
language documentation corpora to a unified format and describe an open-source
tool "convo-parse" to help in quality control and assessment of conversational
data. We conclude with two case studies of how diverse data sets can inform
interactional linguistics and speech recognition technology and thus contribute
to broadening the empirical foundations of language sciences and technologies
of the future.
Related papers
- A Survey on Spoken Italian Datasets and Corpora [0.3222802562733787]
This survey provides a comprehensive analysis of 66 spoken Italian datasets.
The datasets are categorized by speech type, source and context, and demographic and linguistic features.
Challenges related to dataset scarcity, representativeness, and accessibility are discussed.
arXiv Detail & Related papers (2025-01-11T14:33:57Z) - The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings [0.0]
We present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages.
We focus on three Slavic languages, namely Croatian, Polish, and Serbian.
The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts.
arXiv Detail & Related papers (2024-09-23T10:12:18Z) - Tamil Language Computing: the Present and the Future [0.0]
Language computing integrates linguistics, computer science, and cognitive psychology to create meaningful human-computer interactions.
Recent advancements in deep learning have made computers more accessible and capable of independent learning and adaptation.
The paper underscores the importance of building practical applications for languages like Tamil to address everyday communication needs.
arXiv Detail & Related papers (2024-07-11T15:56:02Z) - Variationist: Exploring Multifaceted Variation and Bias in Written Language Data [3.666781404469562]
Exploring and understanding language data is a fundamental stage in all areas dealing with human language.
Yet, there is currently a lack of a unified, customizable tool to seamlessly inspect and visualize language variation and bias.
In this paper, we introduce Variationist, a highly-modular, descriptive, and task-agnostic tool that fills this gap.
arXiv Detail & Related papers (2024-06-25T15:41:07Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Towards a Deep Understanding of Multilingual End-to-End Speech
Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages.
We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z) - Collecting Interactive Multi-modal Datasets for Grounded Language
Understanding [66.30648042100123]
We formalized the collaborative embodied agent using natural language task.
We developed a tool for extensive and scalable data collection.
We collected the first dataset for interactive grounded language understanding.
arXiv Detail & Related papers (2022-11-12T02:36:32Z) - Dialogue Term Extraction using Transfer Learning and Topological Data
Analysis [0.8185867455104834]
We explore different features that can enable systems to discover realizations of domains, slots, and values in dialogues in a purely data-driven fashion.
To examine the utility of each feature set, we train a seed model based on the widely used MultiWOZ data-set.
Our method outperforms the previously proposed approach that relies solely on word embeddings.
arXiv Detail & Related papers (2022-08-22T17:04:04Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn
Knowledge-driven Conversation [66.99734491847076]
We propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs.
Our corpus contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0.
arXiv Detail & Related papers (2020-04-08T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.