Fairness and underspecification in acoustic scene classification: The
case for disaggregated evaluations
- URL: http://arxiv.org/abs/2110.01506v1
- Date: Mon, 4 Oct 2021 15:23:01 GMT
- Title: Fairness and underspecification in acoustic scene classification: The
case for disaggregated evaluations
- Authors: Andreas Triantafyllopoulos, Manuel Milling, Konstantinos Drossos,
Bj\"orn W. Schuller
- Abstract summary: Underspecification and fairness in machine learning (ML) applications have recently become two prominent issues in the ML community.
We argue for the need of a more holistic evaluation process for Acoustic scene classification (ASC) models through disaggregated evaluations.
We demonstrate the effectiveness of the proposed evaluation process in uncovering underspecification and fairness problems when trained on two widely-used ASC datasets.
- Score: 6.186191586944725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Underspecification and fairness in machine learning (ML) applications have
recently become two prominent issues in the ML community. Acoustic scene
classification (ASC) applications have so far remained unaffected by this
discussion, but are now becoming increasingly used in real-world systems where
fairness and reliability are critical aspects. In this work, we argue for the
need of a more holistic evaluation process for ASC models through disaggregated
evaluations. This entails taking into account performance differences across
several factors, such as city, location, and recording device. Although these
factors play a well-understood role in the performance of ASC models, most
works report single evaluation metrics taking into account all different strata
of a particular dataset. We argue that metrics computed on specific
sub-populations of the underlying data contain valuable information about the
expected real-world behaviour of proposed systems, and their reporting could
improve the transparency and trustability of such systems. We demonstrate the
effectiveness of the proposed evaluation process in uncovering
underspecification and fairness problems exhibited by several standard ML
architectures when trained on two widely-used ASC datasets. Our evaluation
shows that all examined architectures exhibit large biases across all factors
taken into consideration, and in particular with respect to the recording
location. Additionally, different architectures exhibit different biases even
though they are trained with the same experimental configurations.
Related papers
- RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models [58.57987316300529]
Large Language Models (LLMs) are increasingly deployed to handle various natural language processing (NLP) tasks.
To evaluate the biases exhibited by LLMs, researchers have recently proposed a variety of datasets.
We propose CEB, a Compositional Evaluation Benchmark that covers different types of bias across different social groups and tasks.
arXiv Detail & Related papers (2024-07-02T16:31:37Z) - FairLENS: Assessing Fairness in Law Enforcement Speech Recognition [37.75768315119143]
We propose a novel and adaptable evaluation method to examine the fairness disparity between different models.
We conducted fairness assessments on 1 open-source and 11 commercially available state-of-the-art ASR models.
arXiv Detail & Related papers (2024-05-21T19:23:40Z) - CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models [49.16989035566899]
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources.
This paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios.
arXiv Detail & Related papers (2024-01-30T14:25:32Z) - Deconstructing Self-Supervised Monocular Reconstruction: The Design
Decisions that Matter [63.5550818034739]
This paper presents a framework to evaluate state-of-the-art contributions to self-supervised monocular depth estimation.
It includes pretraining, backbone, architectural design choices and loss functions.
We re-implement, validate and re-evaluate 16 state-of-the-art contributions and introduce a new dataset.
arXiv Detail & Related papers (2022-08-02T14:38:53Z) - On Generalisability of Machine Learning-based Network Intrusion
Detection Systems [0.0]
In this paper, we evaluate seven supervised and unsupervised learning models on four benchmark NIDS datasets.
Our investigation indicates that none of the considered models is able to generalise over all studied datasets.
Our investigation also indicates that overall, unsupervised learning methods generalise better than supervised learning models in our considered scenarios.
arXiv Detail & Related papers (2022-05-09T08:26:48Z) - What are the best systems? New perspectives on NLP Benchmarking [10.27421161397197]
We propose a new procedure to rank systems based on their performance across different tasks.
Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task.
We show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure.
arXiv Detail & Related papers (2022-02-08T11:44:20Z) - Towards Ubiquitous Indoor Positioning: Comparing Systems across
Heterogeneous Datasets [1.3814679165245243]
The evaluation of Indoor Positioning Systems (IPS) mostly relies on local deployments in the researchers' or partners' facilities.
The dawn of datasets is pushing IPS evaluation to a similar level as machine-learning models.
This paper proposes a way to evaluate IPSs in multiple scenarios, that is validated with three use cases.
arXiv Detail & Related papers (2021-09-20T11:37:36Z) - Through the Data Management Lens: Experimental Analysis and Evaluation
of Fair Classification [75.49600684537117]
Data management research is showing an increasing presence and interest in topics related to data and algorithmic fairness.
We contribute a broad analysis of 13 fair classification approaches and additional variants, over their correctness, fairness, efficiency, scalability, and stability.
Our analysis highlights novel insights on the impact of different metrics and high-level approach characteristics on different aspects of performance.
arXiv Detail & Related papers (2021-01-18T22:55:40Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.