Related papers: Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios

Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios

URL: http://arxiv.org/abs/2012.00876v1
Date: Tue, 1 Dec 2020 22:44:42 GMT
Title: Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios
Authors: Peter Wu, Yifan Zhong, Alan W Black
Abstract summary: We propose a method to analyze language similarity using deep learning. Namely, we train a model on the Wilderness dataset and investigate how its latent space compares with classical language family findings.
Score: 48.57072884674938
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing multilingual speech NLP works focus on a relatively small subset of languages, and thus current linguistic understanding of languages predominantly stems from classical approaches. In this work, we propose a method to analyze language similarity using deep learning. Namely, we train a model on the Wilderness dataset and investigate how its latent space compares with classical language family findings. Our approach provides a new direction for cross-lingual data augmentation in any speech-based NLP task.

Related papers

Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world? [0.7168794329741259]
This study employs embeddings from the fine-tuned XLS-R self-supervised language identification model vox107-xls-r-300m-wav2vec to analyze relationships between 106 world languages.<n>Using linear discriminant analysis (LDA), language embeddings are clustered and compared with genealogical, lexical, and geographical distances.<n>The results demonstrate that embedding-based distances align closely with traditional measures, effectively capturing both global and local typological patterns.
arXiv Detail & Related papers (2025-06-10T08:33:34Z)
From Isolates to Families: Using Neural Networks for Automated Language Affiliation [9.182884165239996]
In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation. We present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families.
arXiv Detail & Related papers (2025-02-17T11:25:32Z)
Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis [7.751856268560216]
This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages. Using phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training.
arXiv Detail & Related papers (2025-01-12T13:29:24Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach. We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target. We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
Robust Open-Set Spoken Language Identification and the CU MultiLang Dataset [2.048226951354646]
Open-set spoken language identification systems can detect when an input exhibits none of the original languages. We implement a novel approach to open-set spoken language identification that uses MFCC and pitch features. We present a spoken language identification system that achieves 91.76% accuracy on trained languages and has the capability to adapt to unknown languages on the fly.
arXiv Detail & Related papers (2023-08-29T00:44:27Z)
Meta-Learning a Cross-lingual Manifold for Semantic Parsing [75.26271012018861]
Localizing a semantic to support new languages requires effective cross-lingual generalization. We introduce a first-order meta-learning algorithm to train a semantic annotated with maximal sample efficiency during cross-lingual transfer. Results across six languages on ATIS demonstrate that our combination of steps yields accurate semantics sampling $le$10% of source training data in each new language.
arXiv Detail & Related papers (2022-09-26T10:42:17Z)
Zero-Shot Dependency Parsing with Worst-Case Aware Automated Curriculum Learning [5.865807597752895]
We adopt a method from multi-task learning, which relies on automated curriculum learning, to dynamically optimize for parsing performance on outlier languages. We show that this approach is significantly better than uniform and size-proportional sampling in the zero-shot setting.
arXiv Detail & Related papers (2022-03-16T11:33:20Z)
Exploring Teacher-Student Learning Approach for Multi-lingual Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages. We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z)
Multilingual Chart-based Constituency Parse Extraction from Pre-trained Language Models [21.2879567125422]
We propose a novel method for extracting complete (binary) parses from pre-trained language models. By applying our method on multilingual PLMs, it becomes possible to induce non-trivial parses for sentences from nine languages.
arXiv Detail & Related papers (2020-04-08T05:42:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.