Abstract: Speaker recognition deals with recognizing speakers by their speech.
Strategies related to speaker recognition may explore speech timbre properties,
accent, speech patterns and so on. Supervised speaker recognition has been
dramatically investigated. However, through rigorous excavation, we have found
that unsupervised speaker recognition systems mostly depend on domain
adaptation policy. This paper introduces a speaker recognition strategy dealing
with unlabeled data, which generates clusterable embedding vectors from small
fixed-size speech frames. The unsupervised training strategy involves an
assumption that a small speech segment should include a single speaker.
Depending on such a belief, we construct pairwise constraints to train twin
deep learning architectures with noise augmentation policies, that generate
speaker embeddings. Without relying on domain adaption policy, the process
unsupervisely produces clusterable speaker embeddings, and we name it
unsupervised vectors (u-vectors). The evaluation is concluded in two popular
speaker recognition datasets for English language, TIMIT, and LibriSpeech.
Also, we include a Bengali dataset, Bengali ASR, to illustrate the diversity of
the domain shifts for speaker recognition systems. Finally, we conclude that
the proposed approach achieves remarkable performance using pairwise