Data Science Kitchen at GermEval 2021: A Fine Selection of Hand-Picked
Features, Delivered Fresh from the Oven
- URL: http://arxiv.org/abs/2109.02383v1
- Date: Mon, 6 Sep 2021 12:00:29 GMT
- Title: Data Science Kitchen at GermEval 2021: A Fine Selection of Hand-Picked
Features, Delivered Fresh from the Oven
- Authors: Niclas Hildebrandt and Benedikt Boenninghoff and Dennis Orth and
Christopher Schymura
- Abstract summary: This paper presents the contribution of the Data Science Kitchen at GermEval 2021 on the identification of toxic, engaging, and fact-claiming comments.
We combine semantic and writing style embeddings derived from pre-trained deep neural networks with additional numerical features, specifically designed for this task.
Our best submission achieved macro-averaged F1-scores of 66.8%, 69.9% and 72.5% for the identification of toxic, engaging, and fact-claiming comments.
- Score: 4.435835732946953
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the contribution of the Data Science Kitchen at GermEval
2021 shared task on the identification of toxic, engaging, and fact-claiming
comments. The task aims at extending the identification of offensive language,
by including additional subtasks that identify comments which should be
prioritized for fact-checking by moderators and community managers. Our
contribution focuses on a feature-engineering approach with a conventional
classification backend. We combine semantic and writing style embeddings
derived from pre-trained deep neural networks with additional numerical
features, specifically designed for this task. Ensembles of Logistic Regression
classifiers and Support Vector Machines are used to derive predictions for each
subtask via a majority voting scheme. Our best submission achieved
macro-averaged F1-scores of 66.8%, 69.9% and 72.5% for the identification of
toxic, engaging, and fact-claiming comments.
Related papers
- USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature
Engineering Strategies for Arabic Dialect Identification [0.0]
We investigate the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features.
During the evaluation phase, our system demonstrates noteworthy results, achieving an F1 score of 62.51%.
arXiv Detail & Related papers (2023-12-16T20:23:53Z) - A ML-LLM pairing for better code comment classification [0.0]
We answer the code comment classification shared task challenge by providing a two-fold evaluation.
Our best model, which took second place in the shared task, is a Neural Network with a Macro-F1 score of 88.401% on the provided seed data.
arXiv Detail & Related papers (2023-10-13T12:43:13Z) - Grounded Keys-to-Text Generation: Towards Factual Open-Ended Generation [92.1582872870226]
We propose a new grounded keys-to-text generation task.
The task is to generate a factual description about an entity given a set of guiding keys, and grounding passages.
Inspired by recent QA-based evaluation measures, we propose an automatic metric, MAFE, for factual correctness of generated descriptions.
arXiv Detail & Related papers (2022-12-04T23:59:41Z) - Association Graph Learning for Multi-Task Classification with Category
Shifts [68.58829338426712]
We focus on multi-task classification, where related classification tasks share the same label space and are learned simultaneously.
We learn an association graph to transfer knowledge among tasks for missing classes.
Our method consistently performs better than representative baselines.
arXiv Detail & Related papers (2022-10-10T12:37:41Z) - UU-Tax at SemEval-2022 Task 3: Improving the generalizability of
language models for taxonomy classification through data augmentation [0.0]
This paper addresses the SemEval-2022 Task 3 PreTENS: Presupposed Taxonomies evaluating Neural Network Semantics.
The goal of the task is to identify if a sentence is deemed acceptable or not, depending on the taxonomic relationship that holds between a noun pair contained in the sentence.
We propose an effective way to enhance the robustness and the generalizability of language models for better classification.
arXiv Detail & Related papers (2022-10-07T07:41:28Z) - BEIKE NLP at SemEval-2022 Task 4: Prompt-Based Paragraph Classification
for Patronizing and Condescending Language Detection [13.944149742291788]
PCL detection task is aimed at identifying language that is patronizing or condescending towards vulnerable communities in the general media.
In this paper, we give an introduction to our solution, which exploits the power of prompt-based learning on paragraph classification.
arXiv Detail & Related papers (2022-08-02T08:38:47Z) - Overview of ADoBo 2021: Automatic Detection of Unassimilated Borrowings
in the Spanish Press [8.950918531231158]
This paper summarizes the main findings of the ADoBo 2021 shared task, proposed in the context of IberLef 2021.
In this task, we invited participants to detect lexical borrowings (coming mostly from English) in Spanish newswire texts.
We provided participants with an annotated corpus of lexical borrowings which we split into training, development and test splits.
arXiv Detail & Related papers (2021-10-29T11:07:59Z) - Detecting Handwritten Mathematical Terms with Sensor Based Data [71.84852429039881]
We propose a solution to the UbiComp 2021 Challenge by Stabilo in which handwritten mathematical terms are supposed to be automatically classified.
The input data set contains data of different writers, with label strings constructed from a total of 15 different possible characters.
arXiv Detail & Related papers (2021-09-12T19:33:34Z) - Out-distribution aware Self-training in an Open World Setting [62.19882458285749]
We leverage unlabeled data in an open world setting to further improve prediction performance.
We introduce out-distribution aware self-training, which includes a careful sample selection strategy.
Our classifiers are by design out-distribution aware and can thus distinguish task-related inputs from unrelated ones.
arXiv Detail & Related papers (2020-12-21T12:25:04Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z) - SemEval-2020 Task 10: Emphasis Selection for Written Text in Visual
Media [50.29389719723529]
We present the main findings and compare the results of SemEval-2020 Task 10, Emphasis Selection for Written Text in Visual Media.
The goal of this shared task is to design automatic methods for emphasis selection.
The analysis of systems submitted to the task indicates that BERT and RoBERTa were the most common choice of pre-trained models used.
arXiv Detail & Related papers (2020-08-07T17:24:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.