Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan
- URL: http://arxiv.org/abs/2404.09342v3
- Date: Mon, 22 Jul 2024 08:43:12 GMT
- Title: Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan
- Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Salman Tahir, Rohan Kumar Das, Muhammad Zaigham Zaheer, Marta Moscati, Markus Schedl, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf,
- Abstract summary: The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario.
This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge.
- Score: 29.23176868272216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge.
Related papers
- Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association [24.843733099049015]
This paper introduces our novel solution to the Face-Voice Association in Multilingual Environments (FAME) 2024 challenge.
It focuses on a contrastive learning-based chaining-cluster method to enhance face-voice association.
We conducted extensive experiments to investigate the impact of language on face-voice association.
The results demonstrate the superior performance of our method, and we validate the robustness and effectiveness of our proposed approach.
arXiv Detail & Related papers (2024-08-04T13:24:36Z) - A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge [16.813582262700415]
The objective of the challenge is to establish a multi-speaker, multi-lingual Indic Text-to-Speech system with voice cloning capabilities.
The system was trained using challenge data and fine-tuned for few-shot voice cloning on target speakers.
arXiv Detail & Related papers (2024-06-22T10:49:36Z) - Summary of the DISPLACE Challenge 2023 -- DIarization of SPeaker and
LAnguage in Conversational Environments [28.618333018398122]
In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages.
Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers.
The DISPLACE challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition.
arXiv Detail & Related papers (2023-11-21T12:23:58Z) - Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation
over More Languages and Beyond [89.54151859266202]
The 2023 Multilingual Speech Universal Performance Benchmark (ML-SUPERB) Challenge expands upon the acclaimed SUPERB framework.
The challenge garnered 12 model submissions and 54 language corpora, resulting in a comprehensive benchmark encompassing 154 languages.
The findings indicate that merely scaling models is not the definitive solution for multilingual speech tasks.
arXiv Detail & Related papers (2023-10-09T08:30:01Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z) - Cross-modal Speaker Verification and Recognition: A Multilingual
Perspective [29.314358875442778]
The aim of this paper is to answer two closely related questions: "Is face-voice association language independent?" and "Can a speaker be recognised of the spoken language?"
To answer them, we collected a Multilingual Audio-Visual dataset, containing human speech clips of $154$ identities with $3$ language annotations extracted from various videos uploaded online.
arXiv Detail & Related papers (2020-04-28T19:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.