Synchronising speech segments with musical beats in Mandarin and English
singing
- URL: http://arxiv.org/abs/2106.10045v1
- Date: Fri, 18 Jun 2021 10:32:27 GMT
- Title: Synchronising speech segments with musical beats in Mandarin and English
singing
- Authors: Cong Zhang, Jian Zhu
- Abstract summary: The presence of musical beats was more dependent on segment duration than sonority.
Mandarin and English demonstrated cross-linguistic variations despite exhibiting common patterns.
- Score: 4.627414193046309
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Generating synthesised singing voice with models trained on speech data has
many advantages due to the models' flexibility and controllability. However,
since the information about the temporal relationship between segments and
beats are lacking in speech training data, the synthesised singing may sound
off-beat at times. Therefore, the availability of the information on the
temporal relationship between speech segments and music beats is crucial. The
current study investigated the segment-beat synchronisation in singing data,
with hypotheses formed based on the linguistics theories of P-centre and
sonority hierarchy. A Mandarin corpus and an English corpus of professional
singing data were manually annotated and analysed. The results showed that the
presence of musical beats was more dependent on segment duration than sonority.
However, the sonority hierarchy and the P-centre theory were highly related to
the location of beats. Mandarin and English demonstrated cross-linguistic
variations despite exhibiting common patterns.
Related papers
- Agent-Driven Large Language Models for Mandarin Lyric Generation [2.2221991003992967]
In tonal contour languages like Mandarin, pitch contours are influenced by both melody and tone, leading to variations in lyric-melody fit.
Our study confirms that lyricists and melody writers consider this fit during their composition process.
In this research, we developed a multi-agent system that decomposes the melody-to-lyric task into sub-tasks, with each agent controlling rhyme, syllable count, lyric-melody alignment, and consistency.
arXiv Detail & Related papers (2024-10-02T12:01:32Z) - Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation [6.225927189801006]
We propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns.
Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences.
arXiv Detail & Related papers (2024-04-03T09:17:38Z) - Acoustic characterization of speech rhythm: going beyond metrics with
recurrent neural networks [0.0]
We train a recurrent neural network on a language identification task over a large database of speech recordings in 21 languages.
The network was able to identify the language of 10-second recordings in 40% of the cases, and the language was in the top-3 guesses in two-thirds of the cases.
arXiv Detail & Related papers (2024-01-22T09:49:44Z) - Unsupervised Melody-Guided Lyrics Generation [84.22469652275714]
We propose to generate pleasantly listenable lyrics without training on melody-lyric aligned data.
We leverage the crucial alignments between melody and lyrics and compile the given melody into constraints to guide the generation process.
arXiv Detail & Related papers (2023-05-12T20:57:20Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure.
We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z) - VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices [4.167459103689587]
We address the problem of lip-voice synchronisation in videos containing human face and voice.
Our approach is based on determining if the lips motion and the voice in a video are synchronised or not.
We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models.
arXiv Detail & Related papers (2022-04-05T10:02:39Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - A Melody-Unsupervision Model for Singing Voice Synthesis [9.137554315375919]
We propose a melody-unsupervision model that requires only audio-and-lyrics pairs without temporal alignment in training time.
We show that the proposed model is capable of being trained with speech audio and text labels but can generate singing voice in inference time.
arXiv Detail & Related papers (2021-10-13T07:42:35Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.