Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish
- URL: http://arxiv.org/abs/2408.00005v1
- Date: Thu, 18 Jul 2024 21:32:12 GMT
- Title: Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish
- Authors: MichaĆ Junczyk,
- Abstract summary: Speech datasets available in the public domain are often underutilized because of challenges in discoverability and interoperability.
A comprehensive framework has been designed to survey, catalog, and curate available speech datasets.
This research constitutes the most extensive comparison to date of both commercial and free ASR systems for the Polish language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speech datasets available in the public domain are often underutilized because of challenges in discoverability and interoperability. A comprehensive framework has been designed to survey, catalog, and curate available speech datasets, which allows replicable evaluation of automatic speech recognition (ASR) systems. A case study focused on the Polish language was conducted; the framework was applied to curate more than 24 datasets and evaluate 25 combinations of ASR systems and models. This research constitutes the most extensive comparison to date of both commercial and free ASR systems for the Polish language. It draws insights from 600 system-model-test set evaluations, marking a significant advancement in both scale and comprehensiveness. The results of surveys and performance comparisons are available as interactive dashboards (https://huggingface.co/spaces/amu-cai/pl-asr-leaderboard) along with curated datasets (https://huggingface.co/datasets/amu-cai/pl-asr-bigos-v2, https://huggingface.co/datasets/pelcra/pl-asr-pelcra-for-bigos) and the open challenge call (https://poleval.pl/tasks/task3). Tools used for evaluation are open-sourced (https://github.com/goodmike31/pl-asr-bigos-tools), facilitating replication and adaptation for other languages, as well as continuous expansion with new datasets and systems.
Related papers
- Optimizing RAG Pipelines for Arabic: A Systematic Analysis of Core Components [0.0]
Retrieval-Augmented Generation (RAG) has emerged as a powerful architecture for combining the precision of retrieval systems with the fluency of large language models.<n>This study presents a comprehensive empirical evaluation of state-of-the-art RAG components-including chunking strategies, embedding models, rerankers, and language models-across a diverse set of Arabic datasets.
arXiv Detail & Related papers (2025-06-01T00:04:58Z) - AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models [84.65095045762524]
We present three desiderata for a good benchmark for language models.
benchmark reveals new trends in model rankings not shown by previous benchmarks.
We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering.
arXiv Detail & Related papers (2024-07-11T10:03:47Z) - Datasets for Multilingual Answer Sentence Selection [59.28492975191415]
We introduce new high-quality datasets for AS2 in five European languages (French, German, Italian, Portuguese, and Spanish)
Results indicate that our datasets are pivotal in producing robust and powerful multilingual AS2 models.
arXiv Detail & Related papers (2024-06-14T16:50:29Z) - FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research [70.6584488911715]
retrieval-augmented generation (RAG) has attracted considerable research attention.
Existing RAG toolkits are often heavy and inflexibly, failing to meet the customization needs of researchers.
Our toolkit has implemented 16 advanced RAG methods and gathered and organized 38 benchmark datasets.
arXiv Detail & Related papers (2024-05-22T12:12:40Z) - SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech
Recognition Evaluation [7.640323749917747]
SpeechColab Leaderboard is a general-purpose, open-source platform designed for ASR evaluation.
We report a comprehensive benchmark, unveiling the current state-of-the-art panorama for ASR systems.
We quantize how distinct nuances in the scoring pipeline influence the final benchmark outcomes.
arXiv Detail & Related papers (2024-03-13T02:41:53Z) - Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition [10.244515100904144]
In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset.
We developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios.
We benchmarked the trained ASR with publicly available datasets and compared it with other available models.
Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets.
arXiv Detail & Related papers (2023-11-06T15:37:14Z) - ADMUS: A Progressive Question Answering Framework Adaptable to Multiple
Knowledge Sources [9.484792817869671]
We present ADMUS, a progressive knowledge base question answering framework designed to accommodate a wide variety of datasets.
Our framework supports the seamless integration of new datasets with minimal effort, only requiring creating a dataset-related micro-service at a negligible cost.
arXiv Detail & Related papers (2023-08-09T08:46:39Z) - Going beyond research datasets: Novel intent discovery in the industry
setting [60.90117614762879]
This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform.
We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision.
We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv.
arXiv Detail & Related papers (2023-05-09T14:21:29Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Integrating Categorical Features in End-to-End ASR [1.332560004325655]
All-neural, end-to-end ASR systems convert speech input to text units using a single trainable neural network model.
E2E models require large amounts of paired speech text data that is expensive to obtain.
We propose a simple yet effective way to integrate categorical features into E2E model.
arXiv Detail & Related papers (2021-10-06T20:07:53Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - CRSLab: An Open-Source Toolkit for Building Conversational Recommender
System [57.208266345350474]
conversational recommender system (CRS) has received much attention in the research community.
Existing studies on CRS vary in scenarios, goals and techniques, lacking unified, standardized implementation or comparison.
We propose an open-source CRS toolkit CRSLab, which provides a unified framework with highly-decoupled modules to develop CRSs.
arXiv Detail & Related papers (2021-01-04T13:10:31Z) - The OARF Benchmark Suite: Characterization and Implications for
Federated Learning Systems [41.90546696412147]
Open Application Repository for Federated Learning (OARF) is a benchmark suite for federated machine learning systems.
OARF mimics more realistic application scenarios with publicly available data sets as different data silos in image, text and structured data.
arXiv Detail & Related papers (2020-06-14T10:11:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.