Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset
Augmentation Using Graph Theory
- URL: http://arxiv.org/abs/2011.01856v1
- Date: Tue, 3 Nov 2020 17:18:03 GMT
- Title: Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset
Augmentation Using Graph Theory
- Authors: Hannah Chen, Yangfeng Ji, David Evans
- Abstract summary: We construct a paraphrase graph from the provided sentence pair labels, and create an augmented dataset by directly inferring labels from the original sentence pairs using a transitivity property.
We evaluate our methods on paraphrase models trained using these datasets starting from a pretrained BERT model, and find that the automatically-enhanced training sets result in more accurate models.
- Score: 21.06607915149245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most NLP datasets are manually labeled, so suffer from inconsistent labeling
or limited size. We propose methods for automatically improving datasets by
viewing them as graphs with expected semantic properties. We construct a
paraphrase graph from the provided sentence pair labels, and create an
augmented dataset by directly inferring labels from the original sentence pairs
using a transitivity property. We use structural balance theory to identify
likely mislabelings in the graph, and flip their labels. We evaluate our
methods on paraphrase models trained using these datasets starting from a
pretrained BERT model, and find that the automatically-enhanced training sets
result in more accurate models.
Related papers
- Bayesian-guided Label Mapping for Visual Reprogramming [20.27639343292564]
One-to-one mappings may overlook the complex relationship between pretrained and downstream labels.
Motivated by this observation, we propose a Bayesian-guided Label Mapping (BLM) method.
Experiments conducted on both pretrained vision models (e.g., ResNeXt) and vision-language models (e.g., CLIP) demonstrate the superior performance of BLM over existing label mapping methods.
arXiv Detail & Related papers (2024-10-31T15:20:43Z) - You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling [60.27812493442062]
We show the importance of investigating labeled data quality to improve any pseudo-labeling method.
Specifically, we introduce a novel data characterization and selection framework called DIPS to extend pseudo-labeling.
We demonstrate the applicability and impact of DIPS for various pseudo-labeling methods across an extensive range of real-world datasets.
arXiv Detail & Related papers (2024-06-19T17:58:40Z) - Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and
Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training.
We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data.
Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z) - Semi-Supervised Graph Imbalanced Regression [17.733488328772943]
We propose a semi-supervised framework to progressively balance training data and reduce model bias via self-training.
Results demonstrate that the proposed framework significantly reduces the error of predicted graph properties.
arXiv Detail & Related papers (2023-05-20T04:11:00Z) - Label Dependencies-aware Set Prediction Networks for Multi-label Text Classification [0.0]
We leverage Graph Convolutional Networks and construct an adjacency matrix based on the statistical relations between labels.
We enhance recall ability by applying the Bhattacharyya distance to the output distributions of the set prediction networks.
arXiv Detail & Related papers (2023-04-14T09:31:17Z) - Learned Label Aggregation for Weak Supervision [8.819582879892762]
We propose a data programming approach that aggregates weak supervision signals to generate labeled data easily.
The quality of the generated labels depends on a label aggregation model that aggregates all noisy labels from all LFs to infer the ground-truth labels.
We show the model can be trained using synthetically generated data and design an effective architecture for the model.
arXiv Detail & Related papers (2022-07-27T14:36:35Z) - Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition [98.25592165484737]
We propose a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL)
CMPL achieves $17.6%$ and $25.1%$ Top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and $1%$ labeled data, respectively.
arXiv Detail & Related papers (2021-12-17T18:59:41Z) - SLADE: A Self-Training Framework For Distance Metric Learning [75.54078592084217]
We present a self-training framework, SLADE, to improve retrieval performance by leveraging additional unlabeled data.
We first train a teacher model on the labeled data and use it to generate pseudo labels for the unlabeled data.
We then train a student model on both labels and pseudo labels to generate final feature embeddings.
arXiv Detail & Related papers (2020-11-20T08:26:10Z) - Handling Missing Data with Graph Representation Learning [62.59831675688714]
We propose GRAPE, a graph-based framework for feature imputation as well as label prediction.
Under GRAPE, the feature imputation is formulated as an edge-level prediction task and the label prediction as a node-level prediction task.
Experimental results on nine benchmark datasets show that GRAPE yields 20% lower mean absolute error for imputation tasks and 10% lower for label prediction tasks.
arXiv Detail & Related papers (2020-10-30T17:59:13Z) - Data Cleansing with Contrastive Learning for Vocal Note Event
Annotations [1.859931123372708]
We propose a novel data cleansing model for time-varying, structured labels.
Our model is trained in a contrastive learning manner by automatically creating local deformations of likely correct labels.
We demonstrate that the accuracy of a transcription model improves greatly when trained using our proposed strategy.
arXiv Detail & Related papers (2020-08-05T12:24:37Z) - Semi-Supervised Models via Data Augmentationfor Classifying Interactive
Affective Responses [85.04362095899656]
We present semi-supervised models with data augmentation (SMDA), a semi-supervised text classification system to classify interactive affective responses.
For labeled sentences, we performed data augmentations to uniform the label distributions and computed supervised loss during training process.
For unlabeled sentences, we explored self-training by regarding low-entropy predictions over unlabeled sentences as pseudo labels.
arXiv Detail & Related papers (2020-04-23T05:02:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.