Shared Multi-modal Embedding Space for Face-Voice Association
- URL: http://arxiv.org/abs/2512.04814v1
- Date: Thu, 04 Dec 2025 14:04:15 GMT
- Title: Shared Multi-modal Embedding Space for Face-Voice Association
- Authors: Christopher Simic, Korbinian Riedhammer, Tobias Bocklet,
- Abstract summary: The FAME 2026 challenge comprises two demanding tasks: training face-voice associations and testing on languages on which the model was not trained.<n>Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction.<n>Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.
- Score: 21.92195248206171
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.
Related papers
- Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge [0.0]
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge.<n>The BEHAVIOR Challenge is a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation.<n>Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
arXiv Detail & Related papers (2025-12-07T18:08:45Z) - SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work [87.9341538630949]
The first Sign Language Production Challenge was held as part of the third SLRTP Workshop at CVPR 2025.<n>The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses.<n>This paper presents the challenge design and the winning methodologies.
arXiv Detail & Related papers (2025-08-09T11:57:33Z) - Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan [24.480174322626155]
The Face-voice Association in Multilingual Environments (FAME) 2026 Challenge focuses on exploring face-voice association under a multilingual scenario.<n>This report provides the details of the challenge, dataset, baseline models, and task details for the FAME Challenge.
arXiv Detail & Related papers (2025-08-06T16:09:47Z) - Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks [112.6716697906318]
We present Dynamic-SUPERB Phase-2, an open benchmark for the comprehensive evaluation of instruction-based universal speech models.<n>Building upon the first generation, this second version incorporates 125 new tasks, expanding the benchmark to a total of 180 tasks.<n> Evaluation results show that no model performed well universally.
arXiv Detail & Related papers (2024-11-08T06:33:22Z) - System Description for the Displace Speaker Diarization Challenge 2023 [0.0]
This paper describes our solution for the Diarization of Speaker and Language in Conversational Environments Challenge (Displace 2023)
We used a combination of VAD for finding segfments with speech, Resnet architecture based CNN for feature extraction from these segments, and spectral clustering for features clustering.
arXiv Detail & Related papers (2024-06-20T21:40:02Z) - The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor Recognition [64.5207572897806]
The Multimodal Sentiment Analysis Challenge (MuSe) 2024 addresses two contemporary multimodal affect and sentiment analysis problems.
In the Social Perception Sub-Challenge (MuSe-Perception), participants will predict 16 different social attributes of individuals.
The Cross-Cultural Humor Detection Sub-Challenge (MuSe-Humor) dataset expands upon the Passau Spontaneous Football Coach Humor dataset.
arXiv Detail & Related papers (2024-06-11T22:26:20Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - A Study on the Integration of Pipeline and E2E SLU systems for Spoken
Semantic Parsing toward STOP Quality Challenge [33.89616011003973]
We describe our proposed spoken semantic parsing system for the quality track (Track 1) in Spoken Language Understanding Grand Challenge.
Strong automatic speech recognition (ASR) models like Whisper and pretrained Language models (LM) like BART are utilized inside our SLU framework to boost performance.
We also investigate the output level combination of various models to get an exact match accuracy of 80.8, which won the 1st place at the challenge.
arXiv Detail & Related papers (2023-05-02T17:25:19Z) - The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion
Share & Requests [66.24715220997547]
The ACM Multimedia 2023 Paralinguistics Challenge addresses two different problems for the first time under well-defined conditions.
In the Emotion Share Sub-Challenge, a regression on speech has to be made; and in the Requests Sub-Challenges, requests and complaints need to be detected.
We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the usual ComPaRE features, the auDeep toolkit, and deep feature extraction from pre-trained CNNs using the DeepSpectRum toolkit.
arXiv Detail & Related papers (2023-04-28T14:42:55Z) - Burst2Vec: An Adversarial Multi-Task Approach for Predicting Emotion,
Age, and Origin from Vocal Bursts [49.31604138034298]
Burst2Vec uses pre-trained speech representations to capture acoustic information from raw waveforms.
Our models achieve a relative 30 % performance gain over baselines using pre-extracted features.
arXiv Detail & Related papers (2022-06-24T18:57:41Z) - Handshakes AI Research at CASE 2021 Task 1: Exploring different
approaches for multilingual tasks [0.22940141855172036]
The aim of the CASE 2021 Shared Task 1 was to detect and classify socio-political and crisis event information in a multilingual setting.
Our submission contained entries in all of the subtasks, and the scores obtained validated our research finding.
arXiv Detail & Related papers (2021-10-29T07:58:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.