Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues
- URL: http://arxiv.org/abs/2504.14963v1
- Date: Mon, 21 Apr 2025 08:44:33 GMT
- Title: Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues
- Authors: Rui Ribeiro, LuĂsa Coheur, Joao P. Carvalho,
- Abstract summary: We explore the use of fuzzy fingerprints from large pre-trained models to improve text-based speaker identification.<n>We integrate speaker-specific tokens and context-aware modeling, demonstrating that conversational context significantly boosts accuracy.<n>We analyze ambiguous utterances and propose a mechanism to detect speaker-agnostic lines.
- Score: 2.048226951354646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speaker identification using voice recordings leverages unique acoustic features, but this approach fails when only textual data is available. Few approaches have attempted to tackle the problem of identifying speakers solely from text, and the existing ones have primarily relied on traditional methods. In this work, we explore the use of fuzzy fingerprints from large pre-trained models to improve text-based speaker identification. We integrate speaker-specific tokens and context-aware modeling, demonstrating that conversational context significantly boosts accuracy, reaching 70.6% on the Friends dataset and 67.7% on the Big Bang Theory dataset. Additionally, we show that fuzzy fingerprints can approximate full fine-tuning performance with fewer hidden units, offering improved interpretability. Finally, we analyze ambiguous utterances and propose a mechanism to detect speaker-agnostic lines. Our findings highlight key challenges and provide insights for future improvements in text-based speaker identification.
Related papers
- ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification [48.98768967435808]
We use computational method to verify if an utterance matches the identity of an enrolled speaker.<n>Despite much success, we have yet to develop a speaker verification system that offers explainable results.<n>A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait.
arXiv Detail & Related papers (2025-01-10T05:53:37Z) - VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features [13.922091192207718]
Sarcasm recognition aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue.
We propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data.
We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.
arXiv Detail & Related papers (2024-08-05T15:36:52Z) - Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - Personalizing Keyword Spotting with Speaker Information [11.4457776449367]
Keywords spotting systems often struggle to generalize to a diverse population with various accents and age groups.
We propose a novel approach that integrates speaker information into keyword spotting using Feature-wise Linear Modulation (FiLM)
Our proposed approach only requires a small 1% increase in the number of parameters, with a minimum impact on latency and computational cost.
arXiv Detail & Related papers (2023-11-06T12:16:06Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Cross-Lingual Speaker Identification Using Distant Supervision [84.51121411280134]
We propose a speaker identification framework that addresses issues such as lack of contextual reasoning and poor cross-lingual generalization.
We show that the resulting model outperforms previous state-of-the-art methods on two English speaker identification benchmarks by up to 9% in accuracy and 5% with only distant supervision.
arXiv Detail & Related papers (2022-10-11T20:49:44Z) - Improving speaker de-identification with functional data analysis of f0
trajectories [10.809893662563926]
Formant modification is a simpler, yet effective method for speaker de-identification which requires no training data.
This study introduces a novel speaker de-identification method, which, in addition to simple formant shifts, manipulates f0 trajectories based on functional data analysis.
The proposed speaker de-identification method will conceal plausibly identifying pitch characteristics in a phonetically controllable manner and improve formant-based speaker de-identification up to 25%.
arXiv Detail & Related papers (2022-03-31T01:34:15Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - A Lightweight Speaker Recognition System Using Timbre Properties [0.5708902722746041]
We propose a lightweight text-independent speaker recognition model based on random forest classifier.
It also introduces new features that are used for both speaker verification and identification tasks.
The prototype uses seven most actively searched properties, boominess, brightness, depth, hardness, timbre, sharpness, and warmth.
arXiv Detail & Related papers (2020-10-12T07:56:03Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.