Automatically Identifying Language Family from Acoustic Examples in Low
Resource Scenarios
- URL: http://arxiv.org/abs/2012.00876v1
- Date: Tue, 1 Dec 2020 22:44:42 GMT
- Title: Automatically Identifying Language Family from Acoustic Examples in Low
Resource Scenarios
- Authors: Peter Wu, Yifan Zhong, Alan W Black
- Abstract summary: We propose a method to analyze language similarity using deep learning.
Namely, we train a model on the Wilderness dataset and investigate how its latent space compares with classical language family findings.
- Score: 48.57072884674938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing multilingual speech NLP works focus on a relatively small subset of
languages, and thus current linguistic understanding of languages predominantly
stems from classical approaches. In this work, we propose a method to analyze
language similarity using deep learning. Namely, we train a model on the
Wilderness dataset and investigate how its latent space compares with classical
language family findings. Our approach provides a new direction for
cross-lingual data augmentation in any speech-based NLP task.
Related papers
- Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Robust Open-Set Spoken Language Identification and the CU MultiLang
Dataset [2.048226951354646]
Open-set spoken language identification systems can detect when an input exhibits none of the original languages.
We implement a novel approach to open-set spoken language identification that uses MFCC and pitch features.
We present a spoken language identification system that achieves 91.76% accuracy on trained languages and has the capability to adapt to unknown languages on the fly.
arXiv Detail & Related papers (2023-08-29T00:44:27Z) - Meta-Learning a Cross-lingual Manifold for Semantic Parsing [75.26271012018861]
Localizing a semantic to support new languages requires effective cross-lingual generalization.
We introduce a first-order meta-learning algorithm to train a semantic annotated with maximal sample efficiency during cross-lingual transfer.
Results across six languages on ATIS demonstrate that our combination of steps yields accurate semantics sampling $le$10% of source training data in each new language.
arXiv Detail & Related papers (2022-09-26T10:42:17Z) - Zero-Shot Dependency Parsing with Worst-Case Aware Automated Curriculum
Learning [5.865807597752895]
We adopt a method from multi-task learning, which relies on automated curriculum learning, to dynamically optimize for parsing performance on outlier languages.
We show that this approach is significantly better than uniform and size-proportional sampling in the zero-shot setting.
arXiv Detail & Related papers (2022-03-16T11:33:20Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Multilingual Chart-based Constituency Parse Extraction from Pre-trained
Language Models [21.2879567125422]
We propose a novel method for extracting complete (binary) parses from pre-trained language models.
By applying our method on multilingual PLMs, it becomes possible to induce non-trivial parses for sentences from nine languages.
arXiv Detail & Related papers (2020-04-08T05:42:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.