Adapting one's voice to different ambient environments and social
interactions is required for human social interaction. In robotics, the ability
to recognize speech in noisy and quiet environments has received significant
attention, but considering ambient cues in the production of social speech
features has been little explored. Our research aims to modify a robot's speech
to maximize acceptability in various social and acoustic contexts, starting
with a use case for service robots in varying restaurants. We created an
original dataset collected over Zoom with participants conversing in scripted
and unscripted tasks given 7 different ambient sounds and background images.
Voice conversion methods, in addition to altered Text-to-Speech that matched
ambient specific data, were used for speech synthesis tasks. We conducted a
subjective perception study that showed humans prefer synthetic speech that
matches ambience and social context, ultimately preferring more human-like
voices. This work provides three solutions to ambient and socially appropriate
synthetic voices: (1) a novel protocol to collect real contextual audio voice
data, (2) tools and directions to manipulate robot speech for appropriate
social and ambient specific interactions, and (3) insight into voice
conversion's role in flexibly altering robot speech to match different ambient
environments.
In robotics, the ability to recognize speech in noisy and quiet environments has received significant attention, but considering ambient cues in the production of social speech features has been little explored.
Our research aims to modify a robot’s speech to maximize acceptability in various social and acoustic contexts, starting with a use case for service robots in varying restaurants.
We created an original dataset collected over Zoom with participants conversing in scripted and unscripted tasks given 7 different ambient sounds and background images.
Voice conversion methods, in addition to altered Text-to-Speech that matched ambient specific data, were used for speech synthesis tasks.
音声合成には, 音声変換法に加えて, 環境データに適合するテキストから音声への変換法が用いられた。
0.74
We conducted a subjective perception study that showed that humans prefer synthetic speech that matches ambience and social context, ultimately preferring a more human-like voices.
This work provides three solutions to ambient and socially appropriate synthetic voices: (1) a novel protocol to collect real contextual audio voice data, (2) tools and directions to manipulate robot speech for appropriate social and ambient specific interactions, and (3) insight into voice conversion’s role in flexibly altering robot speech to match different ambient environments.
Humans have the innate ability to adapt their voice to different contexts and social situations.
人間は声を異なる状況や社会的状況に適応させる能力を持っている。
0.79
Although we consider linguistic vocal phenomenon to be our primary means of communication amongst humans, a significant portion of our communication results from non linguistic vocal features that can completely alter the meaning of phrases [1].
For example, someone may say “it’s over.” at the end of a vacation with a sense of sadness, or they may say “It’s over!” with a sense of enthusiasm after completing an examination they have been studying for.
Given the features associated with our speech are important in communicating in different contexts, it is important for robot’s to be able to communicate in a similar fashion.
their voice to different contexts and social situations has not yet been thoroughly explored.
異なる状況や社会的状況に対する彼らの声はまだ 徹底的に調べられていません
0.62
Robots are used every day in
ロボットは毎日使われています
0.81
*This work was supported by NSERC Discovery Grant 06908-2019 1Emma Hughson, Paige Tutt¨os´ı and Angelica Lim are with the School of Computing Science, Simon Fraser University, 8888 University Dr.
※この研究は、NSERC Discovery Grant 06908-2019 1Emma Hughson、Paige Tutt sos ́ı、Angelica Limが支援し、コンピュータ科学学校、Simon Fraser University, 8888 University Dr。
2Akihiro Matsufuji is with Graduate School of System Design, Faculty of Computing Science, Tokyo Metropolitan University, 6-6 Hino city, Tokyo, Japan matsufuji-akihiro@ed .tmu.ac.jp
Fig. 1. A robot in a fine dining restaurant vs. a night club should adapt its voice to the ambience.
図1。 高級食堂のロボットとナイトクラブのロボットは、その声を周囲に適応させる必要がある。
0.47
vastly different contexts, therefore, the need for a robot to successfully adapt itself into both the ambience and the social environment proves important when integrating robots into humans’ everyday lives [3], [4], [5].
These voices are exceedingly expensive and resource intensive to generate, as such are often outside the means of individuals and small scale companies developing interactive robots.
In these cases the developers often rely on widely available text-to-speech (TTS) services that only allow minor adjustments and voice selections, and, overall, are considered flat and inexpressive.
Our solution is to use a data driven approach to generate the robot’s voice.
私たちのソリューションは、ロボットの声を生成するためにデータ駆動アプローチを使用することです。
0.77
For example, using a corpus of human voices that are collected in different contexts, we use the human speech features to modify those of a robot’s voice.
Placing participants into a simulated environment in the comfort of their own home opens the door to the potential to easily collect naturalistic data more efficiently.
As such, the current study hopes to bridge the following gaps in the literature:
そのため、本研究では、以下の文献のギャップを埋めようとしている。
0.64
1) Implementing a novel protocol for collecting realistic
1)現実的な収集のための新しいプロトコルの実装
0.68
contextual audio voice data 2) Investigating human voice adaptation to determine
文脈音声データ 2)人間の声の適応性を調査して判断する
0.58
1https://www.acapela -group.com/
1https://www.acapela -group.com/
0.17
英語(論文から抽出)
日本語訳
スコア
relevant features that can improve robot voices in different ambient and social contexts
環境と社会の異なる文脈でロボットの声を改善できる関連する特徴
0.83
3) Testing human perception to better understand how humans perceive robot voices, in particular:
3)特にロボットの声の知覚をよりよく理解するために、人間の知覚をテストすること。
0.71
(a) Comparing baseline TTS, adaptive TTS, voice conversion, and human voices,
(a)ベースラインTS、適応TS、音声変換、人声の比較
0.57
(b) How humans perceive voice conversion as adaptive to the environment, and
b)環境に適応して人間が音声変換を知覚する方法、及び
0.81
(c) How humans perceive pitch in TTS against a common social environment
(c)共通の社会環境に対するTTSのピッチの認識について
0.72
II. RELATED WORKS A. HUMAN CONTEXTUAL VOCAL MODIFICATIONS
II。 関連作業 A.人間コンテククチュアルな声帯変調
0.65
Most often human vocal modifications are for the purposes of creating ‘deliberately clear speech,’ when the listener is, for any reason, experiencing reduced comprehension [6].
These modifications are often listener specific as is the case in speech directed towards infants and children [7], those who are hearing impaired [8], and machines [9].
Modifications may otherwise occur when the environment is causing auditory hindrance, as is the case with distant speakers [10], distorted transmission [11], or noisy spaces [11].
Vocal modifications are produced without conscious effort to elicit a specific auditory feature, rather they are produced as a result of achieving the aforementioned goals.
As an example, one of the most well researched and understood vocal phenomena is the Lombard effect [14].
例えば、最もよく研究され理解された発声現象の1つはロンバルド効果[14]である。
0.81
The Lombard effect is an involuntary increase in vocal effort, often due to the presence of background noise [11].
ランゴバルド効果は声の努力が不随意に増加することであり、しばしば背景雑音の存在による [11] 。
0.71
Although it is well understood that humans produce these vocal phenomena in response to ambience and context, the reproduction of these effects in generated speech is relatively new and sparsely studied.
1) Text-to-Speech: TTS has become an inexpensive and efficient means to create realistic voices for the purpose of simulating robotic behaviors [15], [16], [17].
Firstly, for many years the primary concern for TTS was intelligibility; this resulted in voices being produced by the state-of-the-art that can be mistaken for a human voice, yet, they still do not have the ability to adapt to both physical and social contexts.
Furthermore, TTS is rule-based and as such, it is often constrained by Speech Synthesis Markup Language (SSML).
さらに、TSは規則に基づくため、しばしば音声合成マークアップ言語(SSML)によって制約される。
0.80
Although the available features have broadened and include loudness, pitch, and rate-of-speech5, it is not clear whether these features are sufficient for a robot to flexibly and automatically adapt its voice to context.
2) Voice Conversion: Voice conversion is a method whereby a source speaker’s speech waveforms are adjusted to that of the target speaker [22], as such allowing for the modification of source speaker style to match that of a target speaker[23].
Furthermore,it has become increasingly popular to use non-parallel speech [24], [25], keeping the underlying linguistic information, e g , words, without restricting training utterances to contain the same underlying linguistic information [26].
A popular method for voice conversion is to make use of statistical methods, such as the commonly employed Gaussian Mixture Models (GMM) for parallel voice conversion tasks [22].
3) Robotics: Recently, a survey of robotics researchers found that the vast majority choose their voices by convenience rather than considering contextual and use case specific features [29].
This approach to synthesized voices can be problematic as it has been continually agreed upon that first impressions with a robot will determine the course of the user experience [30], [31], [32], [33].
In [3], participants rated the appropriateness of robot voices given different contexts including schools, restaurants, homes and hospitals.
3]では,学校,レストラン,自宅,病院など,さまざまな状況でロボットの声の適切性を評価する。
0.75
They found that even given the same physical appearance, participants selected varying voices depending on context and concluded that a robot voice created for a specific context is likely not generalizable.
Some studies have suggested the incorporation of context based methods such as sociophonetic inspired design [37] and acoustic-prosodic adaption to match user pitch [38].
In addition, further research has made an attempt to produce the Lombard effect, with research relying on incremental adaptation of loudness to the context of distance and user targeting [39] or the adjustment of volume based on environmental noise levels [40].
Due to the trying times of the global pandemic we have created a pioneering method of virtual data collection using readily available tools that will allow researchers to collect data with no physical human interaction.
Our dataset contains speech utterance data and extracted vocal features from 12 participants.
本データセットは,12人の参加者から発声データと発声特徴を抽出した。
0.60
A. DATA COLLECTION PROTOCOL
A.データ収集プロトコル
0.64
During the pandemic, the ability to interact one-on-one in a public area became difficult, and was prohibited by governmental restrictions in several countries across the world.
Zoom is a teleconferencing program which allows individuals to communicate from anywhere in the world.
zoomは、個人が世界中のどこからでもコミュニケーションできる遠隔会議プログラムだ。
0.78
Using Zoom, we paired two participants and had them listen to ambient sounds while conversing with one another in the roles of a waiter and a restaurant-goer.
In addition to sound, participants were asked to change their Zoom background to an image that was pre-selected to match the given ambience (see Figure 2).
Between each ambient condition there was a 1 minute period to update participants’ Zoom backgrounds and prepare for the next condition serving as a washout to reduce carry-over effect from the previous condition.
各環境条件の間には、参加者のズーム背景を1分間更新し、次の条件を洗い出しとして準備し、前の状態からのキャリーオーバー効果を低減するための1分間の期間があった。 訳抜け防止モード: 各環境条件の間には、参加者のZoomバックグラウンドを更新するのに1分かかった。 次の状態に備えて to reduce the carry―over effect from the previous condition
0.72
Each ambience condition was further broken down into 2 subsets: (1) scripted and (2) unscripted; the assigned scripted roles were maintained for the unscripted condition.
The baseline condition was a bakery with no sound or image.
ベースライン条件は、音やイメージのないパン屋だった。
0.61
The ambient sounds can be listened to here7.
周囲の音はここで聴くことができる。
0.71
Scripted condition. Participants first read a brief summary of their character at the specific restaurant.
スクリプト条件。 参加者はまず、特定のレストランで自分のキャラクターの簡単な要約を読む。
0.70
For example, in the fine dining condition the restaurant-goer was on a date, while at the noisy bar the restaurant-goer was with a group of friends to watch the Olympics.
Once each participant read the summary for their character, they then read from a script that was slightly tailored for the given ambience, i.e. food and drink choices matched what is usually offered at that given restaurant.
Unscripted condition. Participants were then told to remain in character and proceed with the initial scenario description of a waiter taking a customer’s order, but this time there was no script to read.
There were a total of 1545 female utterances, here defined as a single sentence, and 796 male utterances.
合計で1545名の女性発声があり、ここでは1文と定義され、男性発声は796名である。
0.64
Altogether there were 2341 clips ranging from 1 to 7 seconds.
合計で2341本が1秒から7秒のクリップであった。
0.68
All recruited participants were undergraduate students at Simon Fraser University who had either a customer service background, experience in improv, or experience in theater.
To investigate these expectations we collected 5 spectral features:
これらの期待を調査するため、私たちは5つのスペクトル特徴を収集した。
0.43
(a) median pitch, (b) pitch range,
(a)中央ピッチ b) ピッチ範囲; ピッチ範囲
0.64
(c) shimmer, (d) jitter, and
(c)シマー。 (d)ジッター、及び
0.65
(e) spectral slope.
(e)スペクトル傾斜。
0.72
Parselmouth was used to extract
Parselmouth (複数形 Parselmouths)
0.30
(a)- (d). Median pitch and pitch range (the difference between the minimum and maximum pitch in a given segment) are calculated in Hz after removing silences and obtaining pitch values from voiced utterances.
Right: Ambience voice adaptation approaches for a given ambience:
右:特定の曖昧性に対するあいまいな音声適応アプローチ:
0.69
(a) TTS Adaptation using temporal and pitch features, or
(a)時間的特徴とピッチ特徴を用いたTS適応、又は
0.72
(b) Voice conversion for spectral features.
(b)スペクトル特徴に対する音声変換
0.77
3) Rate-of-Speech Features: Research has pointed to the decrease in speaking rate when intelligibility becomes increasingly important [46], for example, in loud environments.
3)音声特徴量:大音環境などにおいて,知性の重要性が増すと発話率の低下が指摘されている[46]。
0.79
We also posit that formal speech may be slower and clearer than informal speech.
また、フォーマルなスピーチはフォーマルなスピーチよりも遅く、明瞭である可能性があると仮定する。
0.51
Subsequently, we collected 3 rateof-speech features:
以下の3つの音声特徴を収集した。
0.51
(a) voiced (silences removed) syllables per second
(a)音節毎の音節(音素除去)
0.59
(b) overall (silences are not removed) syllables per second, and
(b)全体(サイレンスを取り除かない)1秒あたりの音節、及び
0.74
(c) pause rate. We employed syllables per second by taking the ratio of number of syllables over the duration(s) for both the voiced and overall utterances.
We defined pause rate as the number of pauses, where a pause is defined as a silence of at least 50 ms between words, over the duration of a entire utterance (both voiced and overall components).
Including the syllables per second for overall utterances along with pause rate may provide information regarding length of pauses and how pauses may impact the length of a utterance [34].
IV. EXPERIMENTAL METHODS CRANK is a voice conversion software that implements several variations of VQ-VAE along with a speaker adversarial training and generative adversarial networks [25].
For the current project, we used CRANK’s best performing model from [25], which is the CycleGAN VQ-VAE with Short Term Fourier Transform (STFT) loss with a speaker adversarial network.
The CycleGAN VQ-VAE is a least squares GAN (LSGAN) with a cyclic VQ-VAE.
サイクロンVQ-VAEは、サイクロンVQ-VAEを持つ最小2乗 GAN (LSGAN) である。
0.64
A pretrained vocoder for speech synthesis was trained on the LJ speech dataset [47], which contains entirely female speakers, and is implemented using the Parallel WaveGAN vocoder repository [48].
Input features provided for CRANK are MLFB, pitch, aperiodicity and spectrum input [25].
CRANKに提供される入力特徴は、MLFB、ピッチ、周期性、スペクトル入力[25]である。
0.67
A. PIPELINE Our experimental pipeline can be found in Figure 2.
a.パイプライン 実験パイプラインは図2で確認できます。
0.62
We tested several data-driven approaches, including (1) TTS Adaptation based on speech rate, pause length and pitch, and (2) Voice Conversion, first setting the TTS speech rate and pause length, then adjusting the voice’s spectral components
The initial procedures involved separating the human speakers by speaker ID then further separating each speaker’s clips into each of the 6 ambience conditions (speaker-ambience sample).
Voice conversion using CRANK resulted in samples that were slowed down significantly compared to the original source and target speakers’ rate-of-speech.
CRANKを用いた音声変換は、元の音源とターゲット話者の発声率と比較して大幅に遅くなった。
0.70
As such, the audio samples generated from our voice conversion approach were
その結果,音声変換手法から生成された音声サンプルが得られた。
0.70
12https://cloud.goog le.com/text-to-speec h/
12https://cloud.goog le.com/text-to-speec h/
0.13
英語(論文から抽出)
日本語訳
スコア
Fig. 3. Radar graphs showing the differences amongst collected features across all female speakers in our dataset.
図3。 データセット内のすべての女性話者間で収集された機能の違いを示すレーダーグラフ。
0.63
The dotted circle represents the baseline ambience for all features.
点線円は全ての特徴の基線環境を表す。
0.69
sped up using Audacity’s tempo change function, which maintains spectral envelope and pitch, to match the rate-ofspeech of the original human speaker, 714.
In addition, the TTS samples and voice conversion samples were set to -10.0 dBFS as to be normalized against the background sound to compare the voice quality and rate-of-speech only.
1) How do generated voiced compare to human voices
1)生成音声と人間の声の比較
0.67
2) How are voice conversion generated voices perceived
2)音声変換はどのように知覚されるか
0.76
within ambiences? (3a of research goals)
環境の中で? (研究目標の3a)
0.49
in context? (3b of research goals)
文脈で? (研究目標の3b)
0.37
Temporal Adjustment. Initially, TTS-avg was selected as the source speaker.
時間調整 当初、TS-avgはソーススピーカーとして選択された。
0.54
However, given the perceptually significant differences in voice characteristics (e g natural pitch) between subjects, it was deemed more appropriate to manipulate the source speaker TTS to match that of a individual speaker.
The ambience specific pre-processing of TTS-714 was integral to the project as voice conversion software primarily uses human speech utterances as source speakers, whereas we are using a synthetic robot voice.
loudness and rate-of-speech, were added before the voice conversion process.
音声変換処理前にラウドネスと音速を付加した。
0.64
This resulted in 211 temporally adjusted TTS-714 clips containing speech utterances of the scripted portions for both waiter and customer roles for each ambience condition.
This means that utterances can be generated that have never been heard by the trained model before, further adding socially and contextually appropriate data to our speaker-ambience training batches.
Finally, the waiter utterances for our source TTS were held out of training to be used as evaluation samples.
最後に, ソースTSのウェイター発話をトレーニングから切り離し, 評価サンプルとして使用した。
0.64
We then used 6 evaluation samples, 1 for each ambience.
次に、各環境毎に6つの評価サンプルを使用しました。
0.57
B. PERCEPTION STUDY The perception study leveraged Mechanical Turk and Survey Monkey with 25 Canadian participants who were fluent English speakers, with 100 Human Intelligence Tasks (HITs) completed with a 98% acceptance rate.
3) How are voice conversion generated voices perceived when paired with the incorrect ambience?
3)不適切な環境下での音声変換はどのように知覚されるか?
0.79
(3b of research goals) 4) How does a data driven pitch manipulation for TTS
(研究目標の3b) 4)TSのためのデータ駆動ピッチ操作法
0.54
impact human perception?
人間の知覚に影響を与える?
0.51
(3c of research goals) Listeners were first asked to use headphones and calibrate their audio.
(3c)研究目標 リスナーはまずヘッドフォンとオーディオの調整を依頼された。
0.55
Next for each of the above research questions participants were told the provided audio sample was the voice of Pepper the robot, who was about to take your order at one of the 6 given ambience locations.
The atmosphere is warm, the music is slow and romantic, and the lights are dimmed.
雰囲気は温かく、音楽は遅く、ロマンチックで、光は薄められています。
0.67
You have waited months to take your date out to this particular restaurant.
あなたはこの特定のレストランにデートするのを何ヶ月も待った。
0.79
In hopes to impress your date you wish to get the duck, the restaurants staple item.
デートを印象づけたいなら、アヒルを欲しがって、レストランのスタイリッシュなアイテムだ。
0.58
Pepper, the robot, is going to take your order.
ロボットのPepperは、あなたの注文を受け取ります。
0.66
” After listening to Peppers’ voice over the background sound, participants were asked to respond to 7 statements using a 7-point likert-scale ranging from 1 (strong disagree) to 7 (strongly agree).
背景音でpeppersの声を聴いた後、参加者は7つのステートメントに対して、1(強い意見の不一致)から7(強い同意)までの7つのポイントのlikert-scaleを使って応答するよう求められた。 訳抜け防止モード: と、peppersの声を背景音で聴いた。 参加者は、7-point likert - scale from 1 (strong disagree ) to 7 (strong agree ) を使って7つのステートメントに回答するよう求められた。
0.74
The following were the statements provided: (1) Pepper’s voice sounds socially appropriate for the scene (figure 1), (2) Pepper’s voice sounds robotic, (3) Pepper is aware of the surrounding ambience, (4) Pepper makes me feel comfortable, (5) Pepper makes me feel like I am in the given ambience location, (6) Pepper is too loud, and (7) Pepper is too quiet.
For the second radar graph, two similarly loud ambiences were compared.
第2のレーダーグラフでは、同様に2つの大きなアンビエンスを比較した。
0.58
Participants experienced a bright, lively restaurant ambience with family-style polka music and trumpet (140 BPM) or a dark night club with electronic music (125 BPM).
As expected in Lombard speech, both ambiences had a high median pitch and lower than baseline pitch range, however, the pitch range for the lively restaurant was slightly higher.
It is possible that the joyful music with a large pitch range may have induced synchrony in vocal pitch patterns compared to the monotonous electronic beat.
Another avenue to investigate may be the level of white noise, which may be perceived as higher in the night club.
もう一つ調べるべき道は白色雑音のレベルであり、夜のクラブでは高いと見なされるかもしれない。
0.72
Shimmer, another feature representative of Lombard speech though hoarseness appears to be more pronounced in the night club than the lively restaurant, perhaps another affect of the joyful ambience.
We once again see features associated with Lombard speech in the noisy ambience including higher energy, shimmer, median pitch and intensity with a decreased pitch range.
Whereas the quiet bar had a higher spectral slope and pitch range and speech rate which may suggest the speaker had an increased liveliness for this ambience.
Six treatments and a baseline, each of the ambiences, were applied to each of the study participants.
被験者それぞれに6つの治療と1つのベースライン(それぞれアンビエンス)を施した。
0.59
Each pair of study participants were
それぞれの研究参加者は
0.74
independent, however, within the pair of waiter and customer we do not have independence as synchrony and mimicking is expected to occur.There was no randomization on the order of ambience, as such the ambiences were applied in the same order for each experiment.
In future studies it would be beneficial to increase the number of participants and complete a full Latin Square Design to better understand carry over effect.
We completed repeated measures ANOVA (rANOVA) for each of the extracted voice features.
抽出した音声特徴のそれぞれに対して,ANOVA(rANOVA)の繰り返し測定を完了した。
0.69
Due to the small sample size of participants and lack of randomization it is difficult to draw formal conclusions, yet, we suggest features that may prove useful and warrant further investigation.
Energy (p < 0.001), spectral slope (p < 0.001) , max (p < 0.001) and mean (p < 0.001) intensity, pause rate (p = 0.002) and mean pitch (p = 0.06) were all significant at a significance threshold of α = 0.1.
These ambiences were chosen due to their polarity in formality and loudness.
これらの環境は形式性や大声さの極性から選ばれた。
0.68
The voice conversion voice was rated the lowest for statement 1 (appropriateness), 3 (awareness), and 4 (comfort), followed by the TTS-bl, yet, TTS-bl was ranked as sounding the most robotic.
This is most likely due to the low quality samples generated by CRANK, which may indicate that more audio samples are required for each speakerambience training batch.
2) RQ 2 : How are voice conversion generated voices perceived in context?
2)RQ2 : 文脈における音声変換はどのように認識されるか?
0.75
: Our 6 ambience specific voice conversion samples were compared.
環境特異的音声変換サンプルを6つ比較した。
0.70
The quiet bar, noisy bar and night club (see Figure 5) were rated the highest for appropriateness, awareness, comfort and statement 5 (ambience feeling).
The fine dining condition was deemed to be the least socially appropriate, least comfortable and most robotic and the lively restaurant condition was deemed to have the least awareness and contextual appropriateness.
3) RQ 3 : How are voice conversion generated voices perceived when paired with the incorrect ambience?
3)RQ3 : 誤った環境と組み合わせた音声変換はどのように知覚されるか?
0.81
: Three components were tested: (1) voice conversion sample for fine dining overlaid on background sound for caf´e, (2) voice conversion sample for caf´e overlaid on background sound for fine dining, and (3) voice conversion sample for night club overlaid on background sound for fine dining.
Component (1) resulted in a boost for statements on appropriateness, awareness, comfort and ambience feeling compared to being overlaid with their respective matching ambiences.
Component (2) also resulted in improvements compared to their respective correct pairings.
component (2)は、それぞれの正しいペアリングと比べて改善されました。
0.64
However, (1) was rated higher than (2).
しかし、(1)は(2)よりも格付けが高かった。
0.71
This indicates that the fine dining voice may have been more suited for the caf´e.
これは、微細な食声がカフェにもっと適していたことを示唆している。
0.54
It is important to note that the fine dining condition was always first and so it may have taken time for participants to adjust to the experiment protocol.
The TTS-bl was rated the lowest for appropriateness, awareness, comfort and ambience feeling.
TTS-blは, 適性, 意識, 快適感, 環境感の最低値と評価された。
0.66
TTS-bl was deemed to be the most robotic sounding, which could contribute to why it had low ratings in other categories.
TTS-blは最もロボティックな音色であるとされ、他のカテゴリーでは低評価であった。
0.73
The results (as shown in Figure 6) indicate that the human pitched TTS (TTSlow) was deemed more socially and contextually appropriate, as well as comforting, rating highest on appropriateness, comfort and ambience feeling, and rated as least robotic
sounding. TTS-high condition was rated second highest for awareness, followed by TTS-low.
音がする TTS高値が2位, TTS高値が2位であった。
0.68
Altogether, using a data driven method to alter pitch demonstrates that humans prefer pitch that matched one specific human’s pitch and when that pitch matched the current social context and ambient environment.
VI. DISCUSSION AND LIMITATIONS This work provided a novel protocol to collect realistic data in order to gain insight into how humans perceive robot voices that adapt to different ambient and social contexts.
Significant and notable features from a total of 12 speakers are also provided.
合計12人の話者による重要な特徴と特筆すべき特徴も提供される。
0.60
One main take away is humans prefer a human voice that matches the social and ambient context, suggesting that there is still a large gap to bridge between current TTS and human voice in these contextual scenarios.
Although humans may have preferred the TTS to voice conversion, we saw a preference for TTS that are data driven and that correspond to a individual speaker altered to match the underlying ambient condition and social context.
The low perception ratings for voice conversion are likely due to the voice conversion’s quality, which could be attributed to the speaker-ambience batch sizes.
As voice conversion is a flexible and adaptive solution for speech synthesis, it shows promise as spectral features in our voice conversion samples were noticeably different between quiet and loud ambiences as described by raters in RQ 2 of the perception study.
In addition, the fine dining restaurant was introduced first and speakers were initially adjusting to the experiment setup; this could have impacted the features for this initial condition.
Fully randomizing the sequence conditions can be a solution.
シーケンス条件の完全なランダム化は解決策となる。
0.76
Another limitation was that, in order to limit independent variables, there was only one phrase for the perceptual experiments, ”Hi there, I hope you’re doing well.
ACKNOWLEDGMENT The authors would like to thank Payam Jome Yazdian, Marine Chamoux, Susana Sanchez-Restrepo and Zhi Yuh Ou Yang for their valuable discussions on this work.
承認 著者は、Payam Jome Yazdian氏、Marine Chamoux氏、Susana Sanchez-Restrepo氏、Zhi Yuh Ou Yang氏によるこの作業に関する貴重な議論に感謝したい。
0.44
英語(論文から抽出)
日本語訳
スコア
REFERENCES [1] H. A. C. Maruri, S. Aslan, G. Stemmer, N. Alyuz, and L. Nachman, “Analysis of contextual voice changes in remote meetings,” in Interspeech, 2021, pp. 2521–2525.
参考 [1] H. A. C. Maruri, S. Aslan, G. Stemmer, N. Alyuz, L. Nachman, “A Analysis of contextual voice change in remote meeting” in Interspeech, 2021, pp. 2521–2525。 訳抜け防止モード: 参考 [1 ]H. A. C. Maruri, S. Aslan, G. Stemmer, N. Alyuz,L. Nachman, “リモートミーティングにおける文脈的音声変化の分析” In Interspeech , 2021 , pp . 2521–2525 .
0.68
[2] A. Elkins and D. Derrick, “The sound of trust: Voice as a measurement of trust during interactions with embodied conversational agents,” Group.
[2] a. elkinsとd. derrickは、”the sound of trust: voice as a measurement of trust during interaction with embodied conversational agents”と書いている。
0.40
Decis. Negot. , vol.
デシス Negot ヴォル。
0.33
22, pp. 897–913, 2013.
22, pp. 897-913, 2013。
0.87
[3] I. Torre, A. B. Latupeirissa, and C. McGinn, “How context shapes the appropriateness of a robot’s voice,” in ROMAN, 2020, pp. 215–222.
[3] i. torre, a. b. latupeirissa, and c. mcginn, “how context shapes the appropriateness of a robot’s voice”. roman, 2020, pp. 215–222. 英語) 訳抜け防止モード: [3 ]I. Torre, A. B. Latupeirissa, C. McGinn ロボットの声の適切さをどのように形作るか」 ROMAN , 2020 , pp. 215-222。
0.73
[4] S. Ivanov, U. Gretzel, K. Berezina, M. Sigala, and C. Webster, “Progress on robotics in hospitality and tourism: a review of the literature,” J. Hosp.
J. Hospはこう言う: “[4]S. Ivanov, U. Gretzel, K. Berezina, M. Sigala, C. Webster, “ホスピタリティと観光におけるロボティクスの進歩:文献のレビュー”。
0.86
Tour. Technol. , 2019.
ツアー テクノル , 2019.
0.45
[5] A. Henschel, G. Laban, and E. Cross, “What makes a robot social? a review of social robots from science fiction to a home or hospital near you,” Curr.
A. Henschel, G. Laban, and E. Cross, “ロボットをソーシャルにするものは何か? 訳抜け防止モード: 5 ] a. henschel、g. laban、e. cross。 ロボットをソーシャルにする理由は何だろうか?サイエンスフィクションから、あなたの近くの家や病院まで、ソーシャルロボットのレビューだ。
0.67
Robot. Rep. , vol.
ロボット。 代表。 ヴォル。
0.48
2, 2021. [6] A. Bradlow, Confluent talker- and listener-oriented forces in clear speech production.
2, 2021. 6] a. bradlow, confluent talker- and listener-oriented forces in clear speech production。
0.43
Walter de Gruyter GmbH and Co. KG, 2008, pp. 241–274.
Walter de Gruyter GmbH and Co. KG, 2008, pp. 241-274。
0.46
[7] D. Burnham, C. Kitamura, and U. Vollmer-Conna, “What’s new, pussycat? on talking to babies and animals,” Science, vol.
D. Burnhamさん、C. Kitamuraさん、そしてU. Vollmer-Connaさんは、赤ちゃんや動物との会話について語っています。 訳抜け防止モード: 【7】d.burnham,c.北村,u.vollmer-conna, 赤ちゃんや動物に話しかけることに関して、pussycatとは何か? 科学専攻。
0.64
296, p. 1435, 2002.
296, p. 1435, 2002。
0.88
[8] C. Lam and C. Kitamura, “Mommy, speak clearly: induced hearing loss shapes vowel hyperarticulation.”
8] C. Lam と C. Kitamura は,「母語:誘発難聴は母音の高調波を誘発する」と明言した。
0.77
Dev. Sci. , vol.
開発。 Sci ヴォル。
0.36
15, no. 2, pp. 212–21, 2012.
15 no. 2, pp. 212-21, 2012 頁。
0.86
[9] C. Mayo, V. Aubanel, and M. Cooke, “Effect of prosodic changes on
9]C. Mayo, V. Aubanel, M. Cooke, “韻律的変化が与える影響
0.82
speech intelligibility,” in Interspeech, vol.
とinterspeech, vol.1で述べている。
0.60
2, 2012. [10] D. Pelegrin-Garcia, B. Smits, J. Brunskog, and C.
2, 2012. [10] D. Pelegrin-Garcia, B. Smits, J. Brunskog, C
0.44
-H. Jeong, “Vocal effort with changing talker-to-listener distance in different acoustic environments.”
-h。 ジェオン,「異なる音響環境における話し手から聞き手の距離を変えることへの取り組み」
0.69
J. Acoust.
j. acoust。
0.78
Soc, vol. 129 4, pp. 1981–90, 2011.
Soc, vol。 129, pp. 1981-90, 2011。
0.61
[11] V. Hazan and R. Baker, “Acoustic-phonetic characteristics of speech to counter adverse listening
11]v.ハザンとr.ベイカー「不聴に対する音声の音響・音声的特徴」
0.63
produced with communicative intent conditions,” J. Acoust.
とJ. Acoust氏は述べている。
0.35
Soc, vol. 130, pp. 2139–52, 2011.
Soc, vol。 130, pp. 2139-52, 2011 頁。
0.58
[12] J. A. Caballero, N. Vergis, X. Jiang, and M. D. Pell, “The sound of
J. A. Caballero, N. Vergis, X. Jiang, M. D. Pell, “The Sound of Sound of” 訳抜け防止モード: [12 ]J. A. Caballero, N. Vergis, X. Jiang, とM.D.ペルは言う。
0.75
im/politeness,” Speech communication, vol.
im/politeness” 音声コミュニケーション。
0.63
102, pp. 39–53, 2018.
102, pp. 39-53, 2018。
0.82
[13] M. Cooke, S. King, M. Garnier, and V. Aubanel, “The listening talker: A review of human and algorithmic context-induced modifications of speech,” Comput.
M. Cooke, S. King, M. Garnier, V. Aubanel, “The listening talker: A review of human and algorithmic context-induced modifieds of speech”. Comput.com(英語)
0.43
Speech Lang. , vol.
スピーチラング。 ヴォル。
0.44
28, no. 2, pp. 543–571, 2014.
28, No. 2, pp. 543-571, 2014。
0.94
[14] E. Lombard, “Le signe de l’´el´evation de la voix,” Ana.
14] E. Lombard, “Le signe de l’ ́el ́evation de la voix”. Ana.
0.47
d. Mal. de L’Oreillexdu du larynx [etc], vol.
d. 男性。 デ L’Oreillexdu du larynx, vol. (英語)
0.42
37, pp. 101–119, 1911.
37, pp. 101-119, 1911。
0.90
[15] S. D. Craig and N. L. Schroeder, “Text-to-speech software and learning: Investigating the relevancy of the voice effect,” J. Educ.
S.D. CraigとN.L. Schroederは、“Text-to-Speech Software and Learning: Investigationing thelevency of the voice effect”と題している。
0.70
Comput. Res., vol.
Comput に登場。
0.22
57, no. 6, pp. 1534–1548, 2019.
57, no. 6, pp. 1534–1548, 2019。
0.48
[16] R. Vipperla, S. Park, K. Choo, S. Ishtiaq, K. Min, S. Bhattacharya, A. Mehrotra, A. G. C. P. Ramos, and N. D. Lane, “Bunched lpcnet : Vocoder for low-cost neural text-to-speech systems,” 2020, [Online] Available:arXiv:2008 .04574.
R. Vipperla, S. Park, K. Choo, S. Ishtiaq, K. Min, S. Bhattacharya, A. Mehrotra, A. G. C. P. Ramos, N. D. Lane, “Bunched lpcnet : Vocoder for low-cost neural text-to-speech system”, 2020, [Online] available:arXiv:2008 .04574. 訳抜け防止モード: [16 ] R. Vipperla, S. Park, K. Choo, S. Ishtiaq, K. Min, S. Bhattacharya, A. Mehrotra, A. G. C. P. Ramos そしてN. D. Lane氏は,“ Bunched lpcnet : Vocoder for low- cost neural text - to - speech systems”と題する。 2020年, [オンライン]提供開始 : arXiv:2008.04574。
0.73
[17] K. -G.
[17]K。 -G。
0.39
Oh, C. -Y. Jung, Y.
ああ、C。 -y。 ジュン、y。
0.55
-G. Lee, and S.
-G。 Lee, and S.
0.42
-J. Kim, “Real-time lip synchronization between text-to-speech (tts) system and robot mouth,” in ROMAN, 2010, pp. 620–625.
-j。 Kim, "Real-time lip sync between text-to-Speech (tts) system and robot mouth" in ROMAN, 2010, pp. 620–625。 訳抜け防止モード: -j。 kim, “real-time lip sync between text - to - speech (tts) system and robot mouth” (英語) 2010年、p.620-625。
0.72
[18] D. Stanton, Y. Wang, and R. Skerry-Ryan, “Predicting expressive speaking style from text in end-to-end speech synthesis,” in SLT, 2018, pp. 595–602.
D. Stanton, Y. Wang, R. Skerry-Ryan, “Predicting expressive talking style from text from end-to-end speech synthesis” in SLT, 2018, pp. 595–602。 訳抜け防止モード: 18 ] d. stanton, y. wang, r. skerry - ryan 「最後にテキストから表現力豊かな話し方を予測する」 -"to-end speech synthesis, " in slt, 2018, pp. 595–602 。
0.62
[19] R. Liu, B. Sisman, G. Gao, and H. Li, “Expressive TTS training with
[19]R. Liu,B. Sisman,G. Gao,H. Li, “Expressive TTS training with Expressive TTS training”
0.38
frame and style reconstruction loss,” CoRR, 2020.
とcorr、2020年。
0.27
[20] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, A. Rosenberg, B. Ramabhadran, and Y. Wu, “Generating diverse and natural text-tospeech samples using a quantized fine-grained vae and auto-regressive prosody prior,” 2020, [Online].
G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, A. Rosenberg, B. Ramabhadran, Y. Wu, “量子化された微細な静脈と自己回帰的な韻律を使って、多種多様な自然なテキスト音声サンプルを生成する”。 訳抜け防止モード: [20 ]G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, A. Rosenberg, B. Ramabhadran とWuは言う。 多様な自然文の生成 -量子化された細粒度とオート-回帰韻律を用いた音声サンプル」 2020 , [ Online ] .
0.80
Available: arXiv:2002.03788.
利用可能:arXiv:2002.03788。
0.38
[21] J. Alvarez, H. Francois, H. Sung, S. Choi, J. Jeong, K. Choo, K. Min, and S. Park, “Camnet: A controllable acoustic model for efficient, expressive, high-quality text-to-speech,” Appl.
J. Alvarez, H. Francois, H. Sung, S. Choi, J. Jeong, K. Choo, K. Min, S. Park, “Camnet: A controllable acoustic model for efficient, expressive, high-quality text-to-speech”. Appl. (英語) 訳抜け防止モード: J. Alvarez, H. Francois, H. Sung, S. Choi, J. Jeong, K. Choo, K. Min, S. Park カムネット : 効率, 表現性, 制御可能な音響モデル high - quality text - to - speech, ” Appl .
0.86
Acoust, vol.
aoust, vol.
0.33
186, p. 108439, 2022.
186年、p.108439、2022年。
0.59
[22] Z. Du, B. Sisman, K. Zhou, and H. Li, “Expressive voice conversion: A joint framework for speaker identity and emotional style transfer,” 2021, [Online] Available:arXiv:2107 .03748.
22] z. du, b. sisman, k. zhou, h. li, “expressive voice conversion: a joint framework for speaker identity and emotional style transfer” 2021, [online] available:arxiv:2107 .03748. (英語) 訳抜け防止モード: [22 ]Z. Du, B. Sisman, K. Zhou, H. Li 表現型音声変換 : 話者識別と感情的スタイル伝達のための共同フレームワーク 2021 , [ Online ] available : arXiv:2107.03748.
0.85
[23] S. Yuan, P. Cheng, R. Zhang, W. Hao, Z. Gan, and L. Carin, “Improving zero-shot voice style transfer via disentangled representation learning,” 2021, [Online] Available:arXiv:2103 .09420.
S. Yuan, P. Cheng, R. Zhang, W. Hao, Z. Gan, L. Carin, “inmproving zero-shot voice style transfer through disentangled representation learning” 2021, [Online] available:arXiv:2103 .09420。 訳抜け防止モード: [23]S.元、P.陳、R.張、 W. Hao, Z. Gan, L. Carin, “非交叉表現学習によるゼロショット音声スタイルの転送の改善。 2021 , [ Online ] が利用可能: arXiv:2103.09420。
0.65
[24] B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” 2020, [Online] Available:arXiv:2008 .03648.
B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice conversion and its challenge: from statistics modeling to Deep Learning”, 2020, [Online] available:arXiv:2008 .03648. 訳抜け防止モード: 【24】b.シスマン、j.ヤマギシ、s.キング h. li, “音声変換の概要とその課題 : 統計的モデリングからディープラーニングまで」 2020年版(オンライン版) : arxiv:2008.03648
0.69
[25] K. Kobayashi, W.
[25]K.小林,W.
0.41
-C. Huang, Y. -C.
-C。 フン、y。 -C。
0.44
Wu, P. L. Tobing, T. Hayashi, and T. Toda, “Crank: An open-source software for nonparallel voice conversion based on vector-quantized variational autoencoder,” 2021, [Online] Available:arXiv:2103 .02858.
Wu, P. L. Tobing, T. Hayashi, T. Toda, “Crank: a open-parallel voice conversion for nonparallel voice conversion based based on vector-quantized variational autoencoder”, 2021, [Online] available:arXiv:2103 .02858. 訳抜け防止モード: Wu, P. L. Tobing, T. Hayashi, T. Toda クランク : ベクトル量子化変分オートエンコーダに基づく非並列音声変換のためのオープンソースソフトウェア 2021 , [ Online ] が利用可能: arXiv:2103.02858。
0.76
[26] H. Vu and M. Akagi, “Non-parallel voice conversion based on hierarchical latent embedding vector quantized variational autoencoder,” in Interspeech, 2020, pp. 140–144.
[26] h. vu と m. akagi は interspeech, 2020, pp. 140–144 で “階層的潜在埋め込みベクトル量子化変分オートエンコーダに基づく非並列音声変換” を行っている。 訳抜け防止モード: [26 ]H.Vu,M.Akagi,「階層的潜在埋め込みベクトル量子化変分オートエンコーダに基づく非並列音声変換」 In Interspeech , 2020 , pp . 140–144.
0.80
[27] B. Sisman, M. Zhang, M. Dong, and H. Li, “On the study of generative adversarial networks for cross-lingual voice conversion,” in ASRU, 2019, pp. 144–151.
[27] b. sisman, m. zhang, m. dong, h. li, “on the study of generative adversarial networks for cross-lingual voice conversion” in asru, 2019, pp. 144–151. (英語) 訳抜け防止モード: [27 ]B.Sisman,M.Zhang,M.D ong,H.Li 「対訳音声変換のための生成的対人ネットワークに関する研究」 ASRU, 2019, pp. 144-151。
0.77
[28] Y. Zhao, W.
[28] y. zhao, w.
0.37
-C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Kinnunen, Z. Ling, and T. Toda, “Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion,” 2020, [Online] Available:arXiv:2008 .12527.
-C。 huang, x. tian, j. yamagishi, r. k. das, t. kinnunen, z. ling, and t. toda, “voice conversion challenge 2020: intra-lingual semi-parallel and cross-lingual voice conversion” 2020, [online] available:arxiv:2008 .12527
0.40
[29] C. McGinn and I. Torre, “Can you tell the robot by the voice? an exploratory study on the role of voice in the perception of robots,” in HRI, 2019, pp. 211–221.
29] c. mcginnとi. torreは、hri, 2019, pp. 211–221で、“音声でロボットに伝えることはできますか? 訳抜け防止モード: [29]C. McGinnとI. Torreは、“ロボットに声で教えられるか? ロボットの知覚における声の役割に関する探索的研究」 HRI, 2019, pp. 211-221。
0.80
[30] I. Torre, J. Goslin, L. White, and D. Zanatto, “Trust in artificial voices: A ”congruency effect” of first impressions and behavioural experience,” in TMS, 2018.
30] i. torre, j. goslin, l. white, d. zanatto, “trust in artificial voices: a ”congruency effect of first impressions and behavioral experience” in tms, 2018” (英語) 訳抜け防止モード: [30 ]I. Torre, J. Goslin, L. White, そして、D. Zanattoは「人工音声の信頼 : A ” Congruency effect ” of first impressions and behavioral experience」と評した。 TMS、2018年。
0.83
[31] S. -L.
[31]S。 -L。
0.38
Lee, I. Lau, S. Kiesler, and C. Y. Chiu, “Human mental models
lee, i. lau, s. kiesler, c. y. chiu, 「人間の精神モデル」
0.67
of humanoid robots,” in ICRA, 2005, pp. 2767 – 2772.
icra, 2005, pp. 2767 – 2772に記載された。
0.43
[32] R. van den Brule, R. Dotsch, G. Bijlstra, D. Wigboldus, and P. Haselager, “Do robot performance and behavioral style affect human trust?: A multi-method approach,” Int.
32] r. van den brule, r. dotsch, g. bijlstra, d. wigboldus, p. haselager, “ロボットのパフォーマンスと行動スタイルは人間の信頼に影響を与えるか? 訳抜け防止モード: R. van den Brule, R. Dotsch, G. Bijlstra, D. WigboldusとP. Haselagerは、“ロボットのパフォーマンスと行動スタイルは人間の信頼に影響を与えるか? : multi- method approach , ” Int 。
0.86
J. Soc. Robot, vol.
J. Soc ロボット、ロボット。
0.53
6, pp. 519–531, 2014.
6, pp. 519-531, 2014。
0.89
[33] S. Kiesler, “Fostering common ground in human-robot interaction,” in
33] s. kiesler著, 『人間とロボットの相互作用における共通基盤を創り出す』
0.65
ROMAN, 2005, pp. 729–734.
ROMAN, 2005, pp. 729–734。
0.90
[34] A. Matsufuji and A. Lim, “Perceptual effects of ambient sound on an artificial agent’s rate of speech,” in Companion of HRI, 2021, pp. 67–70.
[34] a. matsu fuji, a. lim, “perceptual effects of ambient sound on a artificial agent’s rate of speech” in companion of hri, 2021, pp. 67-70. (英語) 訳抜け防止モード: [34 ]松富士, a. lim, 「エージェントの発話速度に対する環境音の知覚的影響」 hri , 2021 , pp. 67-70 の伴奏で。
0.57
[35] Y. Okuno, T. Kanda, M. Imai, H. Ishiguro, and N. Hagita, “Providing route directions: Design of robot’s utterance, gesture, and timing,” in HRI, 2009, pp. 53–60.
[35] y. okuno, t. kanda, m. imai, h. ishiguro, and n. hagita, “providing route directions: design of robot’s utterance, gesture, and timing” in hri, 2009 pp. 53–60。 訳抜け防止モード: [35 ]奥野氏、神田氏、今井氏、 H. IshiguroとN. Hagitaは,「ルートの指示を提供する : ロボットの発話,ジェスチャー,タイミングの設計」 HRI, 2009, pp. 53-60。
0.71
[36] A. H¨onemann and P. Wagner, “Adaptive speech synthesis in a cognitive robotic service apartment: An overview and first steps towards voice selection,” in ESSV, 2015.
[36] a. h sonemannとp. wagnerは、2015年にessvで、“adaptive speech synthesis in a cognitive robot service apartment: an overview and first steps towards voice selection”と題した講演を行った。
0.70
[37] S. J. Sutton, P. Foulkes, D. Kirk, and S. Lawson, “Voice as a design material: Sociophonetic inspired design strategies in human-computer interaction,” in ACM, 2019, p. 1–14.
S.J. Sutton, P. Foulkes, D. Kirk, S. Lawson, “Voice as a design material: Sociophonetic inspired design strategy in human- computer interaction”, ACM, 2019, pp. 1–14。 訳抜け防止モード: 【37】s.j.サットン,p.ファウルクス,d.カーク, and s. lawson, “voice as a design materials: sociophonetic inspired design strategies in human - computer interaction” acm , 2019 年, p. 1-14。
0.77
[38] N. Lubold, E. Walker, and H. Pon-Barry, “Effects of voice-adaptation and social dialogue on perceptions of a robotic learning companion,” in HRI, 2016, pp. 255–262.
N. Lubold, E. Walker, H. Pon-Barryは, HRI, 2016, pp. 255–262で, “音声適応と社会対話がロボット学習仲間の知覚に及ぼす影響” と評した。
0.82
[39] K. Fischer, L. Naik, R. M. Langedijk, T. Baumann, M. Jel´ınek, and O. Palinko, Initiating Human-Robot Interactions Using Incremental Speech Adaptation.
K. Fischer, L. Naik, R. M. Langedijk, T. Baumann, M. Jel ́ınek, O. Palinko, Initimental Speech Adaptation による人間とロボットのインタラクションを開始する。
0.71
New York, NY, USA: ACM, 2021, p. 421–425.
ニューヨーク・ニューヨーク: acm, 2021, p. 421-425。
0.71
[40] A. Hayamizu, M. Imai, K. Nakamura, and K. Nakadai, “Volume adaptation and visualization by modeling the volume level in noisy environments for telepresence system,” in Proceedings of the Second International Conference on Human-Agent Interaction.
40] a. hayamizu, m. imai, k. nakamura, and k. nakadaiは, 第2回人間とエージェントの相互作用に関する国際会議の議事録において, テレプレゼンスシステムのための騒音環境におけるボリュームレベルをモデル化し, ボリューム適応と可視化を行った。
0.70
ACM, 2014, p. 67–74.
2014年、p.67-74。
0.54
[41] J. Sundberg and M. Nordenberg, “Effects of vocal loudness variation on spectrum balance as reflected by the alpha measure of long-termaverage spectra of speech.”
[41] j. sundberg と m. nordenberg は「音声の長期平均スペクトルのアルファ尺度に反映されたスペクトルバランスに及ぼす声の大きさの変化の影響」と述べた。
0.78
J. Acoust.
j. acoust。
0.78
Soc. , vol. 120 1, pp. 453–7, 2006.
Soc ヴォル。 1201, pp. 453–7, 2006。
0.47
[42] A. Castellanos, J.
[42]A. Castellanos, J。
0.45
-M. Bened´ı, and F. Casacuberta, “An analysis of general acoustic-phonetic features for spanish speech produced with the lombard effect,” Speech Commun.
-M。 Bened ́ı, and F. Casacuberta, "A Analysis of general acoustic-phonetic features for Spanish speech produced with the lombard effect, Speech Commun。 訳抜け防止モード: -M。 Bened ́ı と F. Casacuberta は「ロンバルド効果によるスペイン語音声の一般的な音響的特徴の分析」と評した。 スピーチコミューン。
0.58
, vol. 20, no. 1, pp. 23–35, 1996.
ヴォル。 20巻1頁、p.23-35、1996年。
0.43
[43] M. Mller, “Fundamentals of music processing: Audio, analysis, algo-
[43]m. mller著『音楽処理の基礎:オーディオ、分析、アルゴ-』
0.69
rithms, applications,” pp. 24–26, 2015.
rithms, applications”. pp. 24-26, 2015年。
0.85
[44] T. IR and P. A, “yeah,” J. Speech Lang.
[44] T. IR and P. A, “yeah, J. Speech Lang
0.41
Hear, vol. 63, no. 1, pp.
耳が聞こえます。 63,no.1。
0.46
74–82, 2020.
74–82, 2020.
0.42
[45] A. Lerch, An introduction to audio content analysis : applications in signal processing and music informatics / Alexander Lerch.
45] a. lerch, an introduction to audio content analysis : applications in signal processing and music informatics / alexander lerch 訳抜け防止モード: [45 ] A. Lerch : 音声コンテンツ分析入門 : 信号処理への応用 そして音楽情報学/Alexander Lerch氏。
0.80
Wiley, 2012.
2012年、ワイリー。
0.56
[46] J. Krause and A. Panagiotopoulos, “Speaking clearly for older adults with normal hearing: The role of speaking rate,” J. Speech Lang.
j. krauseとa. panagiotopoulosは、“通常の聴覚障害のある高齢者にはっきりと話す: 発話速度の役割。 訳抜け防止モード: [46 ]J. Krause と A. Panagiotopoulos, 正常聴力を有する高齢者に対して明瞭に話すこと J. Speech Lang 。
0.67
Hearing, vol.
Hearing, Vol. 聴力。
0.65
62, pp. 1–9, 2019.
62、p.1-9、2019。
0.67
[47] K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com /
K. Ito and L. Johnson, “The lj speech dataset” https://keithito.com /
0.45
LJ-Speech-Dataset/, 2017.
lj-speech-dataset/、2017年。
0.40
[48] R. Yamamoto, E. Song, and J.
[48] 山本さん e. ソンさん j.
0.60
-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” 2020, [Online].
-M。 kimは、”parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi- resolution spectrogram”([オンライン])と言った。