Snippext: Semi-supervised Opinion Mining with Augmented Data
- URL: http://arxiv.org/abs/2002.03049v1
- Date: Fri, 7 Feb 2020 23:54:23 GMT
- Title: Snippext: Semi-supervised Opinion Mining with Augmented Data
- Authors: Zhengjie Miao, Yuliang Li, Xiaolan Wang, Wang-Chiew Tan
- Abstract summary: Snippext is an opinion mining system developed over a language model that is fine-tuned through semi-supervised learning with augmented data.
A novelty of Snippext is its clever use of a two-trivial approach to achieve state-of-the-art (SOTA) performance with little labeled training data.
- Score: 22.07271774127334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online services are interested in solutions to opinion mining, which is the
problem of extracting aspects, opinions, and sentiments from text. One method
to mine opinions is to leverage the recent success of pre-trained language
models which can be fine-tuned to obtain high-quality extractions from reviews.
However, fine-tuning language models still requires a non-trivial amount of
training data. In this paper, we study the problem of how to significantly
reduce the amount of labeled training data required in fine-tuning language
models for opinion mining. We describe Snippext, an opinion mining system
developed over a language model that is fine-tuned through semi-supervised
learning with augmented data. A novelty of Snippext is its clever use of a
two-prong approach to achieve state-of-the-art (SOTA) performance with little
labeled training data through: (1) data augmentation to automatically generate
more labeled training data from existing ones, and (2) a semi-supervised
learning technique to leverage the massive amount of unlabeled data in addition
to the (limited amount of) labeled data. We show with extensive experiments
that Snippext performs comparably and can even exceed previous SOTA results on
several opinion mining tasks with only half the training data required.
Furthermore, it achieves new SOTA results when all training data are leveraged.
By comparison to a baseline pipeline, we found that Snippext extracts
significantly more fine-grained opinions which enable new opportunities of
downstream applications.
Related papers
- Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models [107.24906866038431]
We propose REWIRE, REcycling the Web with guIded REwrite, to enrich low-quality documents so that they could become useful for training.<n>We show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks.
arXiv Detail & Related papers (2025-06-05T07:12:12Z) - TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment [30.93798042712827]
Training language models (LMs) and their application agents is increasingly costly due to large datasets and models.
We propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns.
Our experiments show that leaner pre-training boosts LM learning efficiency.
arXiv Detail & Related papers (2024-12-31T16:08:15Z) - Investigating the Impact of Semi-Supervised Methods with Data Augmentation on Offensive Language Detection in Romanian Language [2.2823100315094624]
Offensive language detection is a crucial task in today's digital landscape.
Building robust offensive language detection models requires large amounts of labeled data.
Semi-supervised learning offers a feasible solution by utilizing labeled and unlabeled data.
arXiv Detail & Related papers (2024-07-29T15:02:51Z) - Scalable Extraction of Training Data from (Production) Language Models [93.7746567808049]
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset.
We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT.
arXiv Detail & Related papers (2023-11-28T18:47:03Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - Bag of Tricks for Training Data Extraction from Language Models [98.40637430115204]
We investigate and benchmark tricks for improving training data extraction using a publicly available dataset.
The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction.
arXiv Detail & Related papers (2023-02-09T06:46:42Z) - Weakly Supervised Scene Text Detection using Deep Reinforcement Learning [6.918282834668529]
We propose a weak supervision method for scene text detection, which makes use of reinforcement learning (RL)
The reward received by the RL agent is estimated by a neural network, instead of being inferred from ground-truth labels.
We then use our proposed system in a weakly- and semi-supervised training on real-world data.
arXiv Detail & Related papers (2022-01-13T10:15:42Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Ranking Creative Language Characteristics in Small Data Scenarios [52.00161818003478]
We adapt the DirectRanker to provide a new deep model for ranking creative language with small data.
Our experiments with sparse training data show that while the performance of standard neural ranking approaches collapses with small datasets, DirectRanker remains effective.
arXiv Detail & Related papers (2020-10-23T18:57:47Z) - Self-training Improves Pre-training for Natural Language Understanding [63.78927366363178]
We study self-training as another way to leverage unlabeled data through semi-supervised learning.
We introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data.
Our approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks.
arXiv Detail & Related papers (2020-10-05T17:52:25Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.