openXDATA: A Tool for Multi-Target Data Generation and Missing Label
Completion
- URL: http://arxiv.org/abs/2007.13889v1
- Date: Mon, 27 Jul 2020 22:05:53 GMT
- Title: openXDATA: A Tool for Multi-Target Data Generation and Missing Label
Completion
- Authors: Felix Weninger, Yue Zhang, Rosalind W. Picard
- Abstract summary: A common problem in machine learning is to deal with datasets with disjoint label spaces and missing labels.
In this work, we introduce the openXdata tool that completes the missing labels in partially labelled or unlabelled datasets.
We show the ability to estimate both categories and continuous labels for all of the datasets, at rates that approached the ground truth values.
- Score: 23.14045574165086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A common problem in machine learning is to deal with datasets with disjoint
label spaces and missing labels. In this work, we introduce the openXDATA tool
that completes the missing labels in partially labelled or unlabelled datasets
in order to generate multi-target data with labels in the joint label space of
the datasets. To this end, we designed and implemented the cross-data label
completion (CDLC) algorithm that uses a multi-task shared-hidden-layer DNN to
iteratively complete the sparse label matrix of the instances from the
different datasets. We apply the new tool to estimate labels across four
emotion datasets: one labeled with discrete emotion categories (e.g., happy,
sad, angry), one labeled with continuous values along arousal and valence
dimensions, one with both kinds of labels, and one unlabeled. Testing with
drop-out of true labels, we show the ability to estimate both categories and
continuous labels for all of the datasets, at rates that approached the ground
truth values. openXDATA is available under the GNU General Public License from
https://github.com/fweninger/openXDATA.
Related papers
- Exploiting Conjugate Label Information for Multi-Instance Partial-Label Learning [61.00359941983515]
Multi-instance partial-label learning (MIPL) addresses scenarios where each training sample is represented as a multi-instance bag associated with a candidate label set containing one true label and several false positives.
ELIMIPL exploits the conjugate label information to improve the disambiguation performance.
arXiv Detail & Related papers (2024-08-26T15:49:31Z) - You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling [60.27812493442062]
We show the importance of investigating labeled data quality to improve any pseudo-labeling method.
Specifically, we introduce a novel data characterization and selection framework called DIPS to extend pseudo-labeling.
We demonstrate the applicability and impact of DIPS for various pseudo-labeling methods across an extensive range of real-world datasets.
arXiv Detail & Related papers (2024-06-19T17:58:40Z) - Complementary to Multiple Labels: A Correlation-Aware Correction
Approach [65.59584909436259]
We show theoretically how the estimated transition matrix in multi-class CLL could be distorted in multi-labeled cases.
We propose a two-step method to estimate the transition matrix from candidate labels.
arXiv Detail & Related papers (2023-02-25T04:48:48Z) - TransPOS: Transformers for Consolidating Different POS Tagset Datasets [0.8399688944263843]
This paper considers two datasets that label part-of-speech (POS) tags under different tagging schemes.
It proposes a novel supervised architecture employing Transformers to tackle the problem of consolidating two completely disjoint datasets.
arXiv Detail & Related papers (2022-09-24T08:43:53Z) - Learning Semantic Segmentation from Multiple Datasets with Label Shifts [101.24334184653355]
This paper proposes UniSeg, an effective approach to automatically train models across multiple datasets with differing label spaces.
Specifically, we propose two losses that account for conflicting and co-occurring labels to achieve better generalization performance in unseen domains.
arXiv Detail & Related papers (2022-02-28T18:55:19Z) - ECLARE: Extreme Classification with Label Graph Correlations [13.429436351837653]
This paper presents ECLARE, a scalable deep learning architecture that incorporates not only label text, but also label correlations, to offer accurate real-time predictions within a few milliseconds.
ECLARE offers predictions that are 2 to 14% more accurate on both publicly available benchmark datasets as well as proprietary datasets for a related products recommendation task sourced from the Bing search engine.
arXiv Detail & Related papers (2021-07-31T15:13:13Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - A Study on the Autoregressive and non-Autoregressive Multi-label
Learning [77.11075863067131]
We propose a self-attention based variational encoder-model to extract the label-label and label-feature dependencies jointly.
Our model can therefore be used to predict all labels in parallel while still including both label-label and label-feature dependencies.
arXiv Detail & Related papers (2020-12-03T05:41:44Z) - Unsupervised Multi-label Dataset Generation from Web Data [2.267916014951237]
This paper presents a system towards the generation of multi-label datasets from web data in an unsupervised manner.
The generation of a single-label dataset uses an unsupervised noise reduction phase (clustering and selection of clusters using anchors) obtaining a 85% of correctly labeled images.
An unsupervised label augmentation process is then performed to assign new labels to the images in the dataset using the class activation maps and the uncertainty associated with each class.
arXiv Detail & Related papers (2020-05-12T08:57:59Z) - Beyond without Forgetting: Multi-Task Learning for Classification with
Disjoint Datasets [27.570773346794613]
Multi-task Learning (MTL) for classification with disjoint datasets aims to explore MTL when one task only has one labeled dataset.
Inspired by semi-supervised learning, we use unlabeled datasets with pseudo labels to facilitate each task.
We propose our MTL with Selective Augmentation (MTL-SA) method to select the training samples in unlabeled datasets with confident pseudo labels and close data distribution to the labeled dataset.
arXiv Detail & Related papers (2020-03-15T03:19:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.