Pseudo-Labels Are All You Need
- URL: http://arxiv.org/abs/2208.09243v1
- Date: Fri, 19 Aug 2022 09:52:41 GMT
- Title: Pseudo-Labels Are All You Need
- Authors: Bogdan Kosti\'c and Mathis Lucka and Julian Risch
- Abstract summary: We present our submission to the Text Complexity DE Challenge 2022.
The goal is to predict the complexity of a German sentence for German learners at level B.
We find that the pseudo-label-based approach gives impressive results yet requires little to no adjustment to the specific task.
- Score: 3.52359746858894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically estimating the complexity of texts for readers has a variety of
applications, such as recommending texts with an appropriate complexity level
to language learners or supporting the evaluation of text simplification
approaches. In this paper, we present our submission to the Text Complexity DE
Challenge 2022, a regression task where the goal is to predict the complexity
of a German sentence for German learners at level B. Our approach relies on
more than 220,000 pseudo-labels created from the German Wikipedia and other
corpora to train Transformer-based models, and refrains from any feature
engineering or any additional, labeled data. We find that the
pseudo-label-based approach gives impressive results yet requires little to no
adjustment to the specific task and therefore could be easily adapted to other
domains and tasks.
Related papers
- SEMQA: Semi-Extractive Multi-Source Question Answering [94.04430035121136]
We introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion.
We create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions.
arXiv Detail & Related papers (2023-11-08T18:46:32Z) - Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy
in Mental Health and Beyond [66.07002187192448]
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task.
We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol.
We find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens.
arXiv Detail & Related papers (2023-10-09T00:20:59Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Natural Language Decomposition and Interpretation of Complex Utterances [47.30126929007346]
We introduce an approach to handle complex-intent-bearing utterances from a user via a process of hierarchical natural language decomposition.
Our approach uses a pre-trained language model to decompose a complex utterance into a sequence of simpler natural language steps.
Experiments show that the proposed approach enables the interpretation of complex utterances with almost no complex training data.
arXiv Detail & Related papers (2023-05-15T14:35:00Z) - Measuring Annotator Agreement Generally across Complex Structured,
Multi-object, and Free-text Annotation Tasks [79.24863171717972]
Inter-annotator agreement (IAA) is a key metric for quality assurance.
Measures exist for simple categorical and ordinal labeling tasks, but little work has considered more complex labeling tasks.
Krippendorff's alpha, best known for use with simpler labeling tasks, does have a distance-based formulation with broader applicability.
arXiv Detail & Related papers (2022-12-15T20:12:48Z) - Lexical Complexity Controlled Sentence Generation [6.298911438929862]
We introduce a novel task of lexical complexity controlled sentence generation.
It has enormous potential in domains such as grade reading, language teaching and acquisition.
We propose a simple but effective approach for this task based on complexity embedding.
arXiv Detail & Related papers (2022-11-26T11:03:56Z) - Domain Adaptation in Multilingual and Multi-Domain Monolingual Settings
for Complex Word Identification [0.27998963147546146]
Complex word identification (CWI) is a cornerstone process towards proper text simplification.
CWI is highly dependent on context, whereas its difficulty is augmented by the scarcity of available datasets.
We propose a novel training technique for the CWI task based on domain adaptation to improve the target character and context representations.
arXiv Detail & Related papers (2022-05-15T13:21:02Z) - Uniform Complexity for Text Generation [4.867923281108005]
We introduce Uniform Complexity for Text Generation (UCTG), a new benchmark test which raises the challenge of making generative models observe uniform linguistic properties with respect to prompts.
We find that models such as GPT-2 struggle to preserve the complexity of input prompts used in its generations, even if finetuned with professionally written texts.
arXiv Detail & Related papers (2022-04-11T15:19:47Z) - ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification
Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English.
We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.