ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds
- URL: http://arxiv.org/abs/2402.03269v1
- Date: Mon, 5 Feb 2024 18:27:27 GMT
- Title: ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds
- Authors: Masato Hagiwara, Marius Miron, Jen-Yu Liu
- Abstract summary: We introduce ISPA (Inter-Species Phonetic Alphabet), a precise, concise, and interpretable system for transcribing animal sounds into text.
We show that established human language ML paradigms and models, such as language models, can be successfully applied to improve performance.
- Score: 6.751004034983776
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditionally, bioacoustics has relied on spectrograms and continuous,
per-frame audio representations for the analysis of animal sounds, also serving
as input to machine learning models. Meanwhile, the International Phonetic
Alphabet (IPA) system has provided an interpretable, language-independent
method for transcribing human speech sounds. In this paper, we introduce ISPA
(Inter-Species Phonetic Alphabet), a precise, concise, and interpretable system
designed for transcribing animal sounds into text. We compare acoustics-based
and feature-based methods for transcribing and classifying animal sounds,
demonstrating their comparable performance with baseline methods utilizing
continuous, dense audio representations. By representing animal sounds with
text, we effectively treat them as a "foreign language," and we show that
established human language ML paradigms and models, such as language models,
can be successfully applied to improve performance.
Related papers
- Encoding of lexical tone in self-supervised models of spoken language [3.7270979204213446]
This paper aims to analyze the tone encoding capabilities of Spoken Language Models (SLMs)
We show that SLMs encode lexical tone to a significant degree even when they are trained on data from non-tonal languages.
We find that SLMs behave similarly to native and non-native human participants in tone and consonant perception studies.
arXiv Detail & Related papers (2024-03-25T15:28:38Z) - Phonetic and Lexical Discovery of a Canine Language using HuBERT [40.578021131708155]
This paper explores potential communication patterns within dog vocalizations and transcends traditional linguistic analysis barriers.
We present a self-supervised approach with HuBERT, enabling the accurate classification of phoneme labels.
We develop a web-based dog vocalization labeling system to highlight phoneme n-grams, present in the vocabulary, in the dog audio uploaded by users.
arXiv Detail & Related papers (2024-02-25T04:35:45Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - How Generative Spoken Language Modeling Encodes Noisy Speech:
Investigation from Phonetics to Syntactics [33.070158866023]
generative spoken language modeling (GSLM) involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis.
This paper presents the findings of GSLM's encoding and decoding effectiveness at the spoken-language and speech levels.
arXiv Detail & Related papers (2023-06-01T14:07:19Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Differentiable Allophone Graphs for Language-Universal Speech
Recognition [77.2981317283029]
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages.
We present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings.
We build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language.
arXiv Detail & Related papers (2021-07-24T15:09:32Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - CSTNet: Contrastive Speech Translation Network for Self-Supervised
Speech Representation Learning [11.552745999302905]
More than half of the 7,000 languages in the world are in imminent danger of going extinct.
It is relatively easy to obtain textual translations corresponding to speech.
We construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech.
arXiv Detail & Related papers (2020-06-04T12:21:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.