A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research
- URL: http://arxiv.org/abs/2512.08371v3
- Date: Fri, 12 Dec 2025 09:24:40 GMT
- Title: A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research
- Authors: Simon Chung, Colby J. Vorland, Donna L. Maney, Andrew W. Brown,
- Abstract summary: We present a novel sampling algorithm that takes label dependencies into account.<n>We applied this approach to a sample of research articles labeled with 64 biomedical topic categories.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.
Related papers
- Label Distribution Learning with Biased Annotations by Learning Multi-Label Representation [120.97262070068224]
Multi-label learning (MLL) has gained attention for its ability to represent real-world data.<n>Label Distribution Learning (LDL) faces challenges in collecting accurate label distributions.
arXiv Detail & Related papers (2025-02-03T09:04:03Z) - Toward Robustness in Multi-label Classification: A Data Augmentation
Strategy against Imbalance and Noise [31.917931364881625]
Multi-label classification poses challenges due to imbalanced and noisy labels in training data.
We propose a unified data augmentation method, named BalanceMix, to address these challenges.
Our approach includes two samplers for imbalanced labels, generating minority-augmented instances with high diversity.
arXiv Detail & Related papers (2023-12-12T09:09:45Z) - Understanding Label Bias in Single Positive Multi-Label Learning [20.09309971112425]
It is possible to train effective multi-labels using only one positive label per image.
Standard benchmarks for SPML are derived from traditional multi-label classification datasets.
This work introduces protocols for studying label bias in SPML and provides new empirical results.
arXiv Detail & Related papers (2023-05-24T21:41:08Z) - Class-Distribution-Aware Pseudo Labeling for Semi-Supervised Multi-Label
Learning [97.88458953075205]
Pseudo-labeling has emerged as a popular and effective approach for utilizing unlabeled data.
This paper proposes a novel solution called Class-Aware Pseudo-Labeling (CAP) that performs pseudo-labeling in a class-aware manner.
arXiv Detail & Related papers (2023-05-04T12:52:18Z) - Bridging the Gap between Model Explanations in Partially Annotated
Multi-label Classification [85.76130799062379]
We study how false negative labels affect the model's explanation.
We propose to boost the attribution scores of the model trained with partial labels to make its explanation resemble that of the model trained with full labels.
arXiv Detail & Related papers (2023-04-04T14:00:59Z) - Label distribution learning via label correlation grid [9.340734188957727]
We propose a textbfLabel textbfCorrelation textbfGrid (LCG) to model the uncertainty of label relationships.
Our network learns the LCG to accurately estimate the label distribution for each instance.
arXiv Detail & Related papers (2022-10-15T03:58:15Z) - To Aggregate or Not? Learning with Separate Noisy Labels [28.14966756980763]
This paper addresses the question of whether one should aggregate separate noisy labels into single ones or use them separately as given.
We theoretically analyze the performance of both approaches under the empirical risk minimization framework.
Our theorems conclude that label separation is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient.
arXiv Detail & Related papers (2022-06-14T21:32:26Z) - Instance-Dependent Partial Label Learning [69.49681837908511]
Partial label learning is a typical weakly supervised learning problem.
Most existing approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels.
In this paper, we consider instance-dependent and assume that each example is associated with a latent label distribution constituted by the real number of each label.
arXiv Detail & Related papers (2021-10-25T12:50:26Z) - Integrating Unsupervised Clustering and Label-specific Oversampling to
Tackle Imbalanced Multi-label Data [13.888344214818733]
Clustering is performed to find out the key distinct and locally connected regions of a multi-label dataset.
Only the minority points within a cluster are used to generate the synthetic minority points that are used for oversampling.
Experiments using 12 multi-label datasets and several multi-label algorithms show that the proposed method performed very well.
arXiv Detail & Related papers (2021-09-25T19:00:00Z) - Disentangling Sampling and Labeling Bias for Learning in Large-Output
Spaces [64.23172847182109]
We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels.
We provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance.
arXiv Detail & Related papers (2021-05-12T15:40:13Z) - A Study on the Autoregressive and non-Autoregressive Multi-label
Learning [77.11075863067131]
We propose a self-attention based variational encoder-model to extract the label-label and label-feature dependencies jointly.
Our model can therefore be used to predict all labels in parallel while still including both label-label and label-feature dependencies.
arXiv Detail & Related papers (2020-12-03T05:41:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.