FreSaDa: A French Satire Data Set for Cross-Domain Satire Detection
- URL: http://arxiv.org/abs/2104.04828v1
- Date: Sat, 10 Apr 2021 18:21:53 GMT
- Title: FreSaDa: A French Satire Data Set for Cross-Domain Satire Detection
- Authors: Radu Tudor Ionescu, Adrian Gabriel Chifu
- Abstract summary: FreSaDa is a French Satire Data Set composed of 11,570 articles from the news domain.
We employ two classification methods as baselines for our new data set.
- Score: 18.059360820527687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce FreSaDa, a French Satire Data Set, which is
composed of 11,570 articles from the news domain. In order to avoid reporting
unreasonably high accuracy rates due to the learning of characteristics
specific to publication sources, we divided our samples into training,
validation and test, such that the training publication sources are distinct
from the validation and test publication sources. This gives rise to a
cross-domain (cross-source) satire detection task. We employ two classification
methods as baselines for our new data set, one based on low-level features
(character n-grams) and one based on high-level features (average of CamemBERT
word embeddings). As an additional contribution, we present an unsupervised
domain adaptation method based on regarding the pairwise similarities (given by
the dot product) between the training samples and the validation samples as
features. By including these domain-specific features, we attain significant
improvements for both character n-grams and CamemBERT embeddings.
Related papers
- Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets [1.734165485480267]
We propose a new tool for automatically annotating text using written guidelines without providing training samples.
Our results show that the prompt-based approach is comparable with the fine-tuned BERT but without any annotated training data.
Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.
arXiv Detail & Related papers (2024-06-26T10:44:02Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - Boosting Few-Shot Text Classification via Distribution Estimation [38.99459686893034]
We propose two simple yet effective strategies to estimate the distributions of the novel classes by utilizing unlabeled query samples.
Specifically, we first assume a class or sample follows the Gaussian distribution, and use the original support set and the nearest few query samples.
Then, we augment the labeled samples by sampling from the estimated distribution, which can provide sufficient supervision for training the classification model.
arXiv Detail & Related papers (2023-03-26T05:58:39Z) - CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - Semi-Supervised Domain Generalization with Stochastic StyleMatch [90.98288822165482]
In real-world applications, we might have only a few labels available from each source domain due to high annotation cost.
In this work, we investigate semi-supervised domain generalization, a more realistic and practical setting.
Our proposed approach, StyleMatch, is inspired by FixMatch, a state-of-the-art semi-supervised learning method based on pseudo-labeling.
arXiv Detail & Related papers (2021-06-01T16:00:08Z) - Summary-Source Proposition-level Alignment: Task, Datasets and
Supervised Baseline [94.0601799665342]
Aligning sentences in a reference summary with their counterparts in source documents was shown as a useful auxiliary summarization task.
We propose establishing summary-source alignment as an explicit task, while introducing two major novelties.
We create a novel training dataset for proposition-level alignment, derived automatically from available summarization evaluation data.
We present a supervised proposition alignment baseline model, showing improved alignment-quality over the unsupervised approach.
arXiv Detail & Related papers (2020-09-01T17:27:12Z) - BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift.
We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions.
Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z) - Inductive Unsupervised Domain Adaptation for Few-Shot Classification via
Clustering [16.39667909141402]
Few-shot classification tends to struggle when it needs to adapt to diverse domains.
We introduce a framework, DaFeC, to improve Domain adaptation performance for Few-shot classification via Clustering.
Our approach outperforms previous work with absolute gains (in classification accuracy) of 4.95%, 9.55%, 3.99% and 11.62%, respectively.
arXiv Detail & Related papers (2020-06-23T08:17:48Z) - Heavy-tailed Representations, Text Polarity Classification & Data
Augmentation [11.624944730002298]
We develop a novel method to learn a heavy-tailed embedding with desirable regularity properties.
A classifier dedicated to the tails of the proposed embedding is obtained which performance outperforms the baseline.
Numerical experiments on synthetic and real text data demonstrate the relevance of the proposed framework.
arXiv Detail & Related papers (2020-03-25T19:24:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.