A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging
- URL: http://arxiv.org/abs/2004.14312v1
- Date: Wed, 29 Apr 2020 16:36:38 GMT
- Title: A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging
- Authors: Shabnam Behzad, Amir Zeldes
- Abstract summary: We study how a state-of-the-art tagging model trained on different genres performs on Web content from unfiltered Reddit forum discussions.
Our results show that even small amounts of in-domain data can outperform the contribution of data from other Web domains.
- Score: 10.609715843964263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Part of speech tagging is a fundamental NLP task often regarded as solved for
high-resource languages such as English. Current state-of-the-art models have
achieved high accuracy, especially on the news domain. However, when these
models are applied to other corpora with different genres, and especially
user-generated data from the Web, we see substantial drops in performance. In
this work, we study how a state-of-the-art tagging model trained on different
genres performs on Web content from unfiltered Reddit forum discussions. More
specifically, we use data from multiple sources: OntoNotes, a large benchmark
corpus with 'well-edited' text, the English Web Treebank with 5 Web genres, and
GUM, with 7 further genres other than Reddit. We report the results when
training on different splits of the data, tested on Reddit. Our results show
that even small amounts of in-domain data can outperform the contribution of
data an order of magnitude larger coming from other Web domains. To make
progress on out-of-domain tagging, we also evaluate an ensemble approach using
multiple single-genre taggers as input features to a meta-classifier. We
present state of the art performance on tagging Reddit data, as well as error
analysis of the results of these models, and offer a typology of the most
common error types among them, broken down by training corpus.
Related papers
- Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data [3.2771631221674333]
We model the detection of topic-related content as a binary classification task.
Using only a few hundred annotated data points per topic, we detect content related to three German policies.
arXiv Detail & Related papers (2024-07-23T14:31:59Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - A Curriculum Learning Approach for Multi-domain Text Classification
Using Keyword weight Ranking [17.71297141482757]
We propose to use a curriculum learning strategy based on keyword weight ranking to improve the performance of multi-domain text classification models.
The experimental results on the Amazon review and FDU-MTL datasets show that our curriculum learning strategy effectively improves the performance of multi-domain text classification models.
arXiv Detail & Related papers (2022-10-27T03:15:26Z) - Detect Hate Speech in Unseen Domains using Multi-Task Learning: A Case
Study of Political Public Figures [7.52579126252489]
We propose a new Multi-task Learning pipeline that utilizes MTL to train simultaneously across multiple hate speech datasets.
We show strong results when examining generalization error in train-test splits and substantial improvements when predicting on previously unseen datasets.
We also assemble a novel dataset, dubbed PubFigs, focusing on the problematic speech of American Public Political Figures.
arXiv Detail & Related papers (2022-08-22T21:13:38Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing? [22.93722845643562]
We show that POS tagging can still significantly improve parsing performance when using the Stack joint framework.
Considering that it is much cheaper to annotate POS tags than parse trees, we also investigate the utilization of large-scale heterogeneous POS tag data.
arXiv Detail & Related papers (2020-03-06T13:47:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.