Speech separation has been shown effective for multi-talker speech
recognition. Under the ad hoc microphone array setup where the array consists
of spatially distributed asynchronous microphones, additional challenges must
be overcome as the geometry and number of microphones are unknown beforehand.
Prior studies show, with a spatial-temporalinte rleaving structure, neural
networks can efficiently utilize the multi-channel signals of the ad hoc array.
In this paper, we further extend this approach to continuous speech separation.
Several techniques are introduced to enable speech separation for real
continuous recordings. First, we apply a transformer-based network for
spatio-temporal modeling of the ad hoc array signals. In addition, two methods
are proposed to mitigate a speech duplication problem during single talker
segments, which seems more severe in the ad hoc array scenarios. One method is
device distortion simulation for reducing the acoustic mismatch between
simulated training data and real recordings. The other is speaker counting to
detect the single speaker segments and merge the output signal channels.
Experimental results for AdHoc-LibiCSS, a new dataset consisting of continuous
recordings of concatenated LibriSpeech utterances obtained by multiple
different devices, show the proposed separation method can significantly
improve the ASR accuracy for overlapped speech with little performance
degradation for single talker segments.
Under the ad hoc microphone array setup where the array consists of spatially distributed asynchronous microphones, additional challenges must be overcome as the geometry and number of microphones are unknown beforehand.
Prior studies show, with a spatial-temporalinte rleaving structure, neural networks can efficiently utilize the multi-channel signals of the ad hoc array.
In this paper, we further extend this approach to continuous speech separation.
本稿では,このアプローチをさらに継続的音声分離に拡張する。
0.77
Several techniques are introduced to enable speech separation for real continuous recordings.
実際の連続録音に対して音声分離を可能にする技術がいくつか紹介されている。
0.51
First, we apply a transformer-based network for spatio-temporal modeling of the ad hoc array signals.
まず,アドホックアレイ信号の時空間モデリングにトランスフォーマネットワークを適用した。
0.61
In addition, two methods are proposed to mitigate a speech duplication problem during single talker segments, which seems more severe in the ad hoc array scenarios.
The other is speaker counting to detect the single speaker segments and merge the output signal channels.
もう1つは、単一の話者セグメントを検出して出力信号チャネルをマージする話者カウントである。
0.71
Experimental results for AdHoc-LibiCSS, a new dataset consisting of continuous recordings of concatenated LibriSpeech utterances obtained by multiple different devices, show the proposed separation method can significantly improve the ASR accuracy for overlapped speech with little performance degradation for single talker segments.
Index Terms—ad hoc microphone array, speech separation, spatially distributed microphones, speaker counting
index terms—ad hoc microphone array, speech separation, spatially distributed microphones, speaker counting
0.95
I. INTRODUCTION In multi-talker automatic speech recognition (ASR), speech separation plays a critical role for improving the recognition accuracy since conventional ASR systems cannot handle overlapped speech.
While a microphone array with a known geometry has been widely used for far-field speech separation [1]–[4], some attempts have recently been made to utilize ad hoc microphone arrays for speech separation and overlapped speech recognition [5]–[8].
Compared with the fixed microphone array, the ad hoc microphone array comprising multiple independent recording devices, provides more flexibility and allows users to use their own mobile devices, such as cellphones or laptops, to virtually form the microphone array system.
Moreover, the distributed devices can cover a wider space and thus provide more spatial diversity, which may be leveraged by the speech separation algorithms.
In [6], a guided source separation method was applied to the ad hoc array-based separation by using speaker diarization results, where a duplicate word reduction method was also proposed.
In [8], an ad hoc arraybased target speech extraction was proposed by selecting 1best or N-best channels for beamforming.
8]では,ビームフォーミングのための1チャンネルまたはnチャンネルを選択することで,アドホックアレイを用いた目標音声抽出を提案する。 訳抜け防止モード: アドホックアレイを用いたターゲット音声抽出法が提案された [8 ] 1best または N - ビームフォーミングに最適なチャネルを選択する。
0.71
Transform-averagecon catenate [12] and a two stage-based method [13] were proposed for spatially unconstrained microphone arrays, but they were only evaluated with simulated data.
The previously proposed methods share a limitation that they require prior knowledge of utterance boundaries, which were often obtained from ground truth labels.
However, in a realistic conversation scenario, the boundary information of overlapped speech is not easily obtainable.
しかし,現実的な会話シナリオでは,重なり合う音声の境界情報は容易には得られない。
0.84
While [6] used a speaker diarization system to acquire the utterance boundaries, it was based on offline processing whereas streaming processing is desired in many applications.
In addition, in conversations, the speech overlap happens only occasionally.
さらに、会話では、スピーチの重複は時折のみ起こります。
0.73
Therefore, the separation system must not only deal with the overlapped speech but also preserve the speech quality for single speaker regions so as not to degrade the ASR accuracy.
For segments with no speaker overlaps, the incoming speech should be routed to one of the output channels, while the other output channels produce zero or negligible noise.
Moreover, two methods are introduced to mitigate the duplicate speech problem [4], [6] in single speaker regions, which becomes severe especially when the
さらに,単話者領域における重複発声問題 [4] と [6] の軽減のために2つの方法が導入された。
0.76
英語(論文から抽出)
日本語訳
スコア
array consists of different microphones.
配列は異なるマイクロホンで構成されている。
0.58
One is based on data augmentation using device distortion simulation to mimic the acoustic variations of different devices and thereby reduce the mismatch between training data and real recordings.
To enable ad hoc array-based CSS evaluation, we collected a new dataset of long-form multi-talker audio with different consumer devices including cell phones and laptops, which we call AdHoc-LibriCSS.
As with LibriCSS [15], LibriSpeech [17] utterances were concatenated and played back in different conference rooms from multiple loudspeakers to create meeting-like audio files.
Experimental results using this dataset are reported.
このデータセットを用いた実験結果を報告する。
0.67
II. CONTINUOUS SPEECH SEPARATION WITH AD HOC
II。 ADホックによる連続音声分離
0.62
MICROPHONE ARRAYS A.
マイクロホンアレイ A。
0.64
Continuous speech separation The CSS framework [14], [15], [18] attempts to cope with a long-form input signal including multiple partially overlapped or non-overlapped utterances in a streaming fashion.
CSS applies a sliding window to the input signal and performs separation within each window to produce a fixed number of separated signals (two in our experiments).
The window size and the window shift we use are 4s and 2s, respectively.
使用するウィンドウサイズとウィンドウシフトはそれぞれ4sと2sです。
0.69
To make the output signal order consistent with that of the previous window position, the Euclidean distance is calculated between the separated signals of the current and previous windows over the overlapped frames between the two window positions for all possible output permutations.
The output order with the lowest distance is then selected.
次に、最も低い距離の出力順序を選択します。
0.69
The separated signals are then concatenated with overlap-add technique.
分離された信号は重なり加法で結合される。
0.66
B. Transformer-based spatio-temporal modeling Fig.
B.トランスフォーマーに基づく時空間モデリング図
0.66
1 shows the overall architecture and the spatio-temporal processing block of our separation model.
図1は、分離モデルの全体的なアーキテクチャと時空間処理ブロックを示しています。
0.70
The model consists of stacked spatio-temporal processing blocks, which adopts a transformer-based (or more precisely transformer encoderbased) architecture [19].
The input to the separation model is a three-dimensional tensor comprising a multi-channel amplitude spectrogram, followed by global normalization [5].
In the spatio-temporal processing block, a cross-channel selfattention layer exploits nonlinear spatial correlation between different channels and was shown effective in [5].
時空間処理ブロックでは,異なるチャネル間の非線形空間相関を利用して,[5]で有効であった。
0.75
A crossframe self-attention layer allows the network to efficiently capture a long-range acoustic context [16], [20], [21].
III. ADDRESSING SPEECH DUPLICATING PROBLEM In real meetings, single speaker regions occupy most of the meeting time [25].
III。 ADDRESSING SPEECH DUPLICING PROBLEM 実際のミーティングでは、単一のスピーカーリージョンがミーティング時間の大部分を占めます[25]。
0.76
Therefore, it is crucial for speech separation systems to preserve the audio quality for the single speaker regions while performing speech separation for the overlapped regions.
Models trained with permutation invariant training (PIT) [26] tend to generate zero signals when there are fewer speakers than the model’s output channels [1].
However, in the ad hoc microphone array settings, we observed that a resultant model still sometimes generated two output signals for a single speaker voice even when trained on both single- and multitalker segments.
This results in a high insertion error rate for ASR.
これにより、asrへの挿入エラー率が高くなる。
0.60
This problem is more severe for the ad hoc microphone arrays as the same single speaker voice captured by different microphones can be acoustically very different.
A. Data augmentation with device distortion simulation Device distortion simulation is a data augmentation scheme to reduce the mismatch between simulated training data and real multi-channel recordings obtained with different devices.
Each step involves variable parameters, which are randomly chosen within a pre-set range for each microphone.
各ステップには、各マイクのプリセット範囲内でランダムに選択される可変パラメータが含まれます。
0.77
The implementation details are described in Sec.
実装の詳細はsecで説明されている。
0.51
IV-B. B. Output signal merger based on speaker counting To further mitigate the speaker duplication issue, we apply speaker counting in each CSS processing window.
When zero or one speaker is detected, the output signals of the separation model are merged into either one of the output channels by taking their sum.
We then produce a zero signal from the other channel.
すると、他のチャネルからゼロ信号を生成する。
0.79
The speaker counting is performed by using a randomly chosen one channel signal to avoid speaker counting errors caused by the data mismatch between multi-channel simulated training data and real recordings.
A transformer-BLSTM model similar to the speech separatio model is trained for speaker counting.
音声分離モデルに類似した変換器-BLSTMモデルを話者カウントのために訓練する。
0.72
The model structure is the same as Fig 1 except that the speaker counting model does not have cross-channel self-attention layers as it is based on a single channel input.
The model input is an STFT of a randomly chosen single-channel signal.
モデル入力はランダムに選択された単チャンネル信号のSTFTである。
0.79
The model generates a frame-level speaker counting signal.
モデルがフレームレベルスピーカ計数信号を生成する。
0.82
We examine two output types for speaker counting.
話者カウントのための2つの出力タイプを検討する。
0.59
One model, which we call s1 in the experiment section, has a two-output linear layer followed by sigmoid nonlinearity for voice activity detection (VAD) for each speaker.
In both cases, we also add speech separation nodes and perform multi-task learning, which might help better align the speaker counting learning with speech separation.
For each CSS processing window, we determine whether there are multiple speakers in the currently processed window based on the model output and a predetermined threshold.
Evaluation data Following the development of LibriCSS [15], we designed and recorded a new dateset, namely AdHoc-LibiCSS, for evaluation of ad hoc array-based speech separation and multitalker speech recognition algorithms under acoustically realistic conditions.
The recordings were made with multiple devices such as cell phones and laptops.
録音は携帯電話やラップトップなどの複数のデバイスで行われた。
0.75
As with LibriCSS, the new dataset comprises multiple minisessions.
LibriCSSと同様に、新しいデータセットは複数のミニセッションから構成される。
0.57
Two different recording conditions are considered,
2つの異なる記録条件が考慮される。
0.68
#loudspeakers room duration per mini-session #subsets / #mini-sessions per subset #recording devices
#loudspeakers room duration per mini-session #subsets / #mini-sessions per subset #recording devices
0.95
5 TABLE I RECORDING SETUP DETAILS.
5 テーブルI 記録セットデテイル。
0.58
2-speaker 5-speaker personal office meeting room 10 mins 4/8 5
2スピーカー5スピーカー個人会議室10分4/85
0.86
2 4 mins 4/20 5
2 4分 4/20 5
0.78
which we refer to as 2-speaker and 5-speaker scenarios.
これを2話者シナリオと5話者シナリオと呼ぶ。
0.62
The details of these two recording conditions are shown in Table I.
これら2つの記録条件の詳細はテーブルIに示されている。
0.77
There are four subsets, dev-no-overlap, dev-overlap, testno-overlap, and test-overlap, where the dev-∗ and test-∗ subsets use the LibriSpeech dev-clean and test-clean utterances, respectively.
For each mini-session, we firstly sampled N ∈ {2, 5} speakers from the LibriSpeech dev or test set [17] while ensuring that each utterance from every speaker was used only once in the recording.
We then re-arranged and concatenated the utterances from each sampled speaker to form a simulated conversation, which was played by N loudspeakers placed in a room.
For each mini-session, all raw recordings from different devices were synchronized using cross-correlation before separation.
各ミニセッションでは、異なるデバイスからのすべての生記録は分離前に相互相関を使って同期された。
0.56
B. Training data A training set consisting of 375 hours of artificially mixed speech was constructed for speech separation and speaker counting model training.
We divided the training data into five categories based on the overlap style as proposed in [1]: 40% for single speaker segments, 9% for inclusive overlap segments, 6% for sequential overlap segments, 36% for full overlap segments, and 9% for partial overlap segments.
Speaker and microhone locations as well as room dimensions were randomly determined to simulate the ad hoc array setting as described in [5], where room impulse responses were generated with the image method [27].
C. Training schemes For a separation model, the input waveform of each channel was transformed into an STFT representation with 257 frequency bins every 16 ms. Layer normalization was performed on the input magnitude spectrum vectors.
C. 訓練方式 分離モデルでは、各チャネルの入力波形を16ミリ秒ごとに257個の周波数ビンでSTFT表現に変換し、入力マグニチュードスペクトルベクトル上でレイヤー正規化を行った。
0.85
Three spatiotemporal processing blocks were stacked.
3つの時空間処理ブロックを積み重ねた。
0.59
The self-attention for spatial modeling and temporal modeling both had 128dimensional embedding spaces and eight attention heads.
Two BLSTM layers and a final linear layer are stacked on top.
2つのBLSTM層と最後の線形層が上に積み重ねられている。
0.77
The VAD-based s1 model had a sigmoid activation function to produce two VAD signals.
VADベースのs1モデルは2つのVAD信号を生成するシグモイド活性化機能を備えていた。
0.60
For both models, we performed multi-task learning by using speech separation as an auxiliary task.
両モデルに対して,音声分離を補助課題としたマルチタスク学習を行った。
0.75
It should be noted that, for s1 model training, PIT was independently applied to speech separation and VAD estimation.
s1モデルトレーニングでは,PITは音声分離とVAD推定に独立して適用された点に注意が必要である。
0.71
Both speaker counting models adopted an MSE loss for training.
どちらの話者計数モデルも訓練にmse損失を採用した。
0.60
The separation loss and the speaker counting loss were given an equal weight.
分離損失および話者カウント損失は等しい重量を与えられました。
0.70
At test time, the separation output was ignored.
テスト時に分離出力は無視された。
0.73
Model training was continued until a validation loss did not decrease for 10 continuous epochs.
モデルトレーニングは10時間連続で検証損失が減少するまで継続された。
0.72
D. Evaluation scheme For each mini-session, the CSS module using the trained separation model generated two output streams, each of which was then processed by a speech recognizer.
d. 各ミニセッションの評価スキーム、訓練された分離モデルを用いたcssモジュールは、2つの出力ストリームを生成し、それぞれを音声認識器で処理した。
0.76
Then, the recognition outputs were evaluated with asclite [28], [29], which can align multiple (two in this work) hypotheses against multiple reference transcriptions.
We used an in-house hybrid ASR system [30] with 5-gram decoding trained on 33k hours of audio, including close-talking, distance-microhpone, and artificially corrupted speech.
我々は,33k時間の音声で5グラムの復号を訓練した室内ハイブリッドASRシステム[30]を用いた。
0.60
E. Results and discussions Tables II and III show the WER results for various overlap ratios for the 2-speaker and 5-speaker scenarios, respectively.
For the dev-overlap and test-overlap subsets, the results are
dev-overlapとtest-overlapのサブセットに対して、結果は
0.65
broken down by the mini-session overlap ratio.
ミニセッション重複比で分解します。
0.60
For each setting, we present the results of the following systems: (1) ASR applied to a randomly chosen channel without speech separation (ori); (2) ASR applied to the signals separated by the model trained without data augmentation (sep); (3) ASR applied to the signals separated by the model trained on device distortion simulated data (sep+dis); (4) systems performing speaker counting-based channel merger on top of (3) (sep+dis+spk-cnt).
The results show that the separation model improved the WER for highly overlapped cases, but it resulted in significant degradation for less overlapped cases without the proposed duplication mitigation methods.
However, the WER degradation for the no-overlap subsets was still significant for both the 2speaker and 5-speaker cases.
しかし, 2 話者と 5 話者の双方にとって, ノーオーバーラップサブセットの WER 劣化は依然として有意であった。
0.61
The channel merger processing using speaker counting mostly solved this problem, resulting in significant WER improvement for the highly overlapped data without compromising the ASR accuracy for the nooverlap subset.
話者カウントを用いたチャネルマージ処理は, 主にこの問題を解決し, nooverlapサブセットのasr精度を損なうことなく, 重複度の高いデータに対して有意な wer 改善を実現した。
0.73
Among the two speaker counting schemes, the s2 system outperformed the s1 system in the 2-speaker scenario for almost all overlap conditions.
In the 5-speaker case, both models performed equally well.
5スピーカの場合、両方のモデルが等しくうまく動作しました。
0.60
V. CONCLUSIONS
V.コンキュレーション
0.76
We described a CSS system for ad hoc microphone arrays.
アドホックマイクロホンアレイのためのCSSシステムについて述べる。
0.73
A transformer-based architecture was applied for separation.
分離のためにトランスベースのアーキテクチャが適用された。
0.50
To mitigate the speech duplicating problem for non-overlapped segments, we proposed data augmentation based on device distortion simulation to reduce the mismatch between training data and the real recordings obtained with spatially distributed devices.
The use of speaker counting was also introduced to further mitigate the issue.
問題をさらに緩和するために、スピーカーカウントの使用も導入された。
0.67
Multi-talker ASR experiments were performed by using newly recorded AdHoc-LibriCSS, showing that the proposed system significantly improved the ASR accuracy for recordings including various degrees of overlaps while retaining the WER for non-overlapped speech.
[23] J. Heymann, L. Drude, and R. Haeb-Umbach, “A generic neural acoustic beamforming architecture for robust multi-channel speech processing,” Computer Speech, Language, vol.
J. Heymann, L. Drude, R. Haeb-Umbach, “A generic neural acoustic beamforming architecture for robust multi-channel speech processing”, Computer Speech, Language, vol. 英語)
0.86
46, pp. 374–385, 2017.
46, pp。 374–385, 2017.
0.82
[24] C. Boeddeker, H. Erdogan, T. Yoshioka, and R. Haeb-Umbach, “Exploring practical aspects of neural mask-based beamforming for farfield speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp.
C. Boeddeker, H. Erdogan, T. Yoshioka, R. Haeb-Umbach, “Exploring practical aspects of Neural mask-based beamforming for farfield speech Recognition” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp.
0.84
6697– 6701.
6697– 6701.
0.94
¨O. C¸ etin and E. Shriberg, “Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition,” in INTERSPEECH, 2006.
うーん。 etin, e. shriberg両氏は2006年にinterspeechで,“ダイアログファクタ,ホットスポット,講演者,収集サイトによるミーティングの重なりの分析: 自動音声認識のための洞察”と題した講演を行った。
0.66
[26] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc.
[26]D. Yu, M. Kolbæk, Z. Tan, J. Jensen, “Permutation invariant training of deep model for speaker-independent multi-talker speech separation” in Proc.
0.98
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp.
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017, pp。
0.77
241–245. “gpuRIR: A python library for room impulse response simulation with GPU acceleration,” in arXiv, 2018.
“Multiple dimension levenshtein edit distance calculations for evaluating automatic speech recognition systems during simultaneous speech,” in Proceedings of Language Resources and Evaluation (LREC), 2006.
[27] D. Diaz-Guerra, A. Miguel, and J. R. Beltran,
27] D. Diaz-Guerra, A. Miguel, J. R. Beltran
0.92
[28] J. Fiscus, J. Ajot, N. Radde, and C. Laprun,
[28] J. Fiscus, J. Ajot, N. Radde, C. Laprun
0.90
[29] “https://github.com/u snistgov/sctk,” 2018.
[29] “https://github.com/u snistgov/sctk” 2018。
0.69
[30] S. Xue and Z. Yan, “Improving latency-controlled blstm acoustic models for online speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp.
2017年IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017 pp. [30] S. Xue and Z. Yan, “Improving while-control blstm acoustic model for online speech recognition” に登壇しました。
0.88
5340– 5344.
5340– 5344.
0.94
REFERENCES [1] T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multi-microphone neural speech separation for far-field multi-talker speech recognition,” in Proc.
参考 [1] T. Yoshioka, H. Erdogan, Z. Chen, F. Alleva, "Multi-microphone neural Speech separation for far-field Multi-talker Speech Recognition", Proc。
0.67
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp.
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018, pp。
0.76
5739–5743.
5739–5743.
0.71
[2] F. Bahmaninezhad, J. Wu, R. Gu, S. Zhang, Y. Xu, M. Yu, and D. Yu, “A comprehensive study of speech separation: Spectrogram vs waveform separation,” in Proc.
[2] F. Bahmaninezhad, J. Wu, R. Gu, S. Zhang, Y. Xu, M. Yu, D. Yu, “A comprehensive study of speech separation: Spectrogram vs Waveform separation” in Proc.
0.94
Interspeech, 2019, pp.
Interspeech, 2019, pp。
0.82
4574–4578.
4574–4578.
0.71
[3] X. Chang, W. Zhang, Y. Qian, J.
[3]X. Chang, W. Zhang, Y. Qian, J.
0.98
Le Roux, and S. Watanabe, “MIMOSPEECH: End-to-end multi-channel multi-speaker speech recognition,” in Automatic Speech Recognition and Understanding Workshop, Dec. 2019.
Le Roux, S. Watanabe, “MIMOSPEECH: End-to-end Multi-Speaker speech Recognition”, Automatic Speech Recognition and Understanding Workshop, 2019年12月。
0.89
[4] Z. Wang, P. Wang, and D. Wang, “Multi-microphone complex spectral mapping for utterance-wise and continuous speaker separation,” in arXiv, 2020.
Z. Wang, P. Wang, D. Wang, “Multi-microphone complex spectrum mapping for utterance-wise and continuous speaker separation” in arXiv, 2020。
0.83
[5] D. Wang, Z. Chen, and T. Yoshioka, “Neural speech separation using spatially distributed microphones,” in Proc.
5] D. Wang, Z. Chen, T. Yoshioka, “Neural speech separation using spacely distributed microphones”, Proc。
0.76
Interspeech, 2020.
2020年インタースピーチ。
0.77
[6] S. Horiguchi, Y. Fujita, and K. Nagamatsu, “Utterance-wise meeting transcription system using asynchronous distributed microphones,” in Proc.
堀口 S. Horiguchi, Y. Fujita, K. Nagamatsu 氏は Proc で,“非同期分散マイクロホンを用いた発話型ミーティングトランスクリプションシステム” について述べている。
0.66
Interspeech, 2020.
2020年インタースピーチ。
0.77
[7] S. Horiguchi, Y. Fujita, and K. Nagamatsu, “Block-online guided source separation,” in arXiv, 2020.
7] 2020 年 arXiv で S. Horiguchi, Y. Fujita, K. Nagamatsu, “Block-online guide source separation” を発表しました。
0.81
[8] Z. Yang, S. Guan, and X. Zhang, “Deep ad-hoc beamforming based on speaker extraction for target-dependent speech separation,” in arXiv, 2020.
Z. Yang, S. Guan, and X. Zhang, “Deep ad-hoc beamforming based on speaker extract for target-dependent speech separation” in arXiv, 2020。
0.83
[9] Z. Liu, “Sound source seperation with distributed microphone arrays in the presence of clocks synchronization errors,” in Proc.
procの[9] z. liu, “クロック同期エラーの存在下で、分散マイクロホンアレイによる音源分離”。
0.70
International Workshop for Acoustic Echo and Noise Control (IWAENC), 2008, p. 14–17.
International Workshop for Acoustic Echo and Noise Control (IWAENC), 2008, pp. 14-17。
0.88
[10] S. Araki, N. Ono, K. Kinoshita, and M. Delcroix, “Meeting recognition with asynchronous distributed microphone array using block-wise refinement of mask-based MVDR beamformer,” in Proc.
10] araki, n. ono, k. kinoshita, m. delcroix, "procにおけるマスクベースmvdrビームフォーマーのブロック分割による非同期分散マイクロホンアレーによる認識"。
0.76
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp.
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP, April 2018, pp。
0.78
5694–5698.
5694–5698.
0.71
[11] T. Yoshioka, D. Dimitriadis, A. Stolcke, W. Hinthorn, Z. Chen, M. Zeng, and X. Huang, “Meeting transcription using asynchronous distant microphones,” in Proc.
11] T. Yoshioka, D. Dimitriadis, A. Stolcke, W. Hinthorn, Z. Chen, M. Zeng, X. Huang, “Meeting transcription usingsync remote microphones” とProcは述べている。
0.91
Interspeech, 2019, p. 2968–2972.
2019年、p.2968-2972。
0.56
“End-to-end [12] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, microphone permutation and number invariant multi-channel speech in Proc.
End-to-end [12] Y. Luo, Z. Chen, N. Mesgarani, T. Yoshioka, microphone permutation and number invariant multi-channel speech in Proc. (英語)
0.79
IEEE International Conference on Acoustics, separation,” Speech and Signal Processing (ICASSP), 2020.
IEEE International Conference on Acoustics, separation”. Speech and Signal Processing (ICASSP) 2020
0.73
[13] N. Furnon, R. Serizel, I. Illina, and S. Essid, “Distributed speech separation in spatially unconstrained microphone arrays,” in arXiv, 2020.
13] N. Furnon, R. Serizel, I. Illina, S. Essid, "Distributed Speech separation in spacely unconstrained microphone arrays" in arXiv, 2020。
0.80
[14] T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David, D. Dimitriadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang, A. Hurvitz, L. Jiang, S. Koubi, E. Krupka, I. Leichter, C. Liu, P. Parthasarathy, A. Vinnikov, L. Wu, X. Xiao, W. Xiong, H. Wang, Z. Wang, J. Zhang, Y. Zhao, and T. Zhou, “Advances in online audio-visual meeting transcription,” in Proc.
T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David, D. Dimitriadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang, A. Hurvitz, L. Jiang, S. Koubi, E. Krupka, I. Leichter, C. Liu, P. Parthasarathy, A. Vinnikov, L. Wu, X. Xiao, W. Xiong, H. Wang, Z. Wang, J. Zhang, Y. Zhao, T. Zhou, “Advancess in online-vis transcription meeting in Proc.”[14]
0.94
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019年。
0.79
[15] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, and J. Li, “Continuous speech separation: dataset and analysis,” in Proc.
[15] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, J. Li, “Continuous speech separation: dataset and analysis” in Proc.
0.97
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)、2020年。
0.84
[16] S. Chen, Y. Wu, Z. Chen, J. Wu, J. Li, T. Yoshioka, C. Wang, S. Liu, and M. Zhou, “Continuous speech separation with conformer,” in arXiv, 2020.
16] S. Chen, Y. Wu, Z. Chen, J. Wu, J. Li, T. Yoshioka, C. Wang, S. Liu, M. Zhou, “conformerによる連続音声分離”, arXiv, 2020。
0.89
[17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc.
17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: ASR corpus based on public domain audio book” とProcは述べている。
0.92
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp.
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015, pp。
0.77
5206–5210.
5206–5210.
0.71
[18] T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, and F. Alleva, “Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks,” in Proc.
18]th. yoshioka, h. erdogan, z. chen, x. xiao, f. alleva, “ミーティングで重複するスピーチを認識する: ニューラルネットワークを用いたマルチチャネル分離アプローチ” procは、この2つだ。 訳抜け防止モード: 18 ] T. Yoshioka, H. Erdogan, Z. Chen X. XiaoとF. Alleva。 会議における重複したスピーチの認識 ニューラルネットワークを用いたマルチチャネル分離手法「Proc」。
0.75
Interspeech, 2018, pp.
原書、2018年、p。
0.34
3038–3042.
3038–3042.
0.71
[19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc.
[19]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, “Attention is all you need”, Proc.
0.94
NeurIPS, 2017, p. 1–11.
2017年、p.1-11。
0.59
[20] C. Liu and Y. Sato, “Self-attention for multi-channel speech separation in noisy and reverberant environments,” in Proceedings of APSIPA Annual Summit and Conference, 2020.
C. Liu, Y. Sato, “Self-attention for multi- channel speech separation in noisy and reverberant environment” in Proceedings of APSIPA Annual Summit and Conference, 2020。
0.80
[21] C. Zheng, X. Peng, Y. Zhang, S. Srinivasan, and Y. Lu, “Interactive speech and noise modeling for speech enhancement,” in Proceedings of AAAI Conference on Artificial Intelligence, 2021.
21] c. zheng, x. peng, y. zhang, s. srinivasan, y. lu, “interactive speech and noise modeling for speech enhancement” in proceedings of aaai conference on artificial intelligence, 2021” (英語) 訳抜け防止モード: [21]C.Zheng,X.Peng,Y.Zha ng, S. Srinivasan, and Y. Lu, “Interactive speech and noise modeling for speech enhancement, .” in Proceedings of AAAI Conference on Artificial Intelligence, 2021