Snippext: Semi-supervised Opinion Mining with Augmented Data
- URL: http://arxiv.org/abs/2002.03049v1
- Date: Fri, 7 Feb 2020 23:54:23 GMT
- Title: Snippext: Semi-supervised Opinion Mining with Augmented Data
- Authors: Zhengjie Miao, Yuliang Li, Xiaolan Wang, Wang-Chiew Tan
- Abstract summary: Snippext is an opinion mining system developed over a language model that is fine-tuned through semi-supervised learning with augmented data.
A novelty of Snippext is its clever use of a two-trivial approach to achieve state-of-the-art (SOTA) performance with little labeled training data.
- Score: 22.07271774127334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online services are interested in solutions to opinion mining, which is the
problem of extracting aspects, opinions, and sentiments from text. One method
to mine opinions is to leverage the recent success of pre-trained language
models which can be fine-tuned to obtain high-quality extractions from reviews.
However, fine-tuning language models still requires a non-trivial amount of
training data. In this paper, we study the problem of how to significantly
reduce the amount of labeled training data required in fine-tuning language
models for opinion mining. We describe Snippext, an opinion mining system
developed over a language model that is fine-tuned through semi-supervised
learning with augmented data. A novelty of Snippext is its clever use of a
two-prong approach to achieve state-of-the-art (SOTA) performance with little
labeled training data through: (1) data augmentation to automatically generate
more labeled training data from existing ones, and (2) a semi-supervised
learning technique to leverage the massive amount of unlabeled data in addition
to the (limited amount of) labeled data. We show with extensive experiments
that Snippext performs comparably and can even exceed previous SOTA results on
several opinion mining tasks with only half the training data required.
Furthermore, it achieves new SOTA results when all training data are leveraged.
By comparison to a baseline pipeline, we found that Snippext extracts
significantly more fine-grained opinions which enable new opportunities of
downstream applications.
Related papers
- Scalable Extraction of Training Data from (Production) Language Models [93.7746567808049]
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset.
We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT.
arXiv Detail & Related papers (2023-11-28T18:47:03Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - Bag of Tricks for Training Data Extraction from Language Models [98.40637430115204]
We investigate and benchmark tricks for improving training data extraction using a publicly available dataset.
The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction.
arXiv Detail & Related papers (2023-02-09T06:46:42Z) - Weakly Supervised Scene Text Detection using Deep Reinforcement Learning [6.918282834668529]
We propose a weak supervision method for scene text detection, which makes use of reinforcement learning (RL)
The reward received by the RL agent is estimated by a neural network, instead of being inferred from ground-truth labels.
We then use our proposed system in a weakly- and semi-supervised training on real-world data.
arXiv Detail & Related papers (2022-01-13T10:15:42Z) - From Universal Language Model to Downstream Task: Improving
RoBERTa-Based Vietnamese Hate Speech Detection [8.602181445598776]
We propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection.
Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.
arXiv Detail & Related papers (2021-02-24T09:30:55Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Ranking Creative Language Characteristics in Small Data Scenarios [52.00161818003478]
We adapt the DirectRanker to provide a new deep model for ranking creative language with small data.
Our experiments with sparse training data show that while the performance of standard neural ranking approaches collapses with small datasets, DirectRanker remains effective.
arXiv Detail & Related papers (2020-10-23T18:57:47Z) - Self-training Improves Pre-training for Natural Language Understanding [63.78927366363178]
We study self-training as another way to leverage unlabeled data through semi-supervised learning.
We introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data.
Our approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks.
arXiv Detail & Related papers (2020-10-05T17:52:25Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.