cs60075_team2 at SemEval-2021 Task 1 : Lexical Complexity Prediction
using Transformer-based Language Models pre-trained on various text corpora
- URL: http://arxiv.org/abs/2106.02340v1
- Date: Fri, 4 Jun 2021 08:42:00 GMT
- Title: cs60075_team2 at SemEval-2021 Task 1 : Lexical Complexity Prediction
using Transformer-based Language Models pre-trained on various text corpora
- Authors: Abhilash Nandy, Sayantan Adak, Tanurima Halder, Sai Mahesh Pokala
- Abstract summary: This paper describes the performance of the team cs60075_team2 at SemEval 2021 Task 1 - Lexical Complexity Prediction.
The main contribution of this paper is to fine-tune transformer-based language models pre-trained on several text corpora.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper describes the performance of the team cs60075_team2 at SemEval
2021 Task 1 - Lexical Complexity Prediction. The main contribution of this
paper is to fine-tune transformer-based language models pre-trained on several
text corpora, some being general (E.g., Wikipedia, BooksCorpus), some being the
corpora from which the CompLex Dataset was extracted, and others being from
other specific domains such as Finance, Law, etc. We perform ablation studies
on selecting the transformer models and how their individual complexity scores
are aggregated to get the resulting complexity scores. Our method achieves a
best Pearson Correlation of $0.784$ in sub-task 1 (single word) and $0.836$ in
sub-task 2 (multiple word expressions).
Related papers
- USB: A Unified Summarization Benchmark Across Tasks and Domains [68.82726887802856]
We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks.
We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models.
arXiv Detail & Related papers (2023-05-23T17:39:54Z) - Transformer Based Implementation for Automatic Book Summarization [0.0]
Document Summarization is the procedure of generating a meaningful and concise summary of a given document.
This work is an attempt to use Transformer based techniques for Abstract generation.
arXiv Detail & Related papers (2023-01-17T18:18:51Z) - Conciseness: An Overlooked Language Task [11.940413163824887]
We define the task and show that it is different from related tasks such as summarization and simplification.
We demonstrate that conciseness is a difficult task for which zero-shot setups with large neural language models often do not perform well.
arXiv Detail & Related papers (2022-11-08T09:47:11Z) - Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple
Tasks [77.90900650816046]
We introduce $textZemi$, a zero-shot semi-parametric language model.
We train $textZemi$ with a novel semi-parametric multitask prompted training paradigm.
Specifically, we augment the multitask training and zero-shot evaluation with retrieval from a large-scale task-agnostic unlabeled corpus.
arXiv Detail & Related papers (2022-10-01T04:08:50Z) - Blessing of Class Diversity in Pre-training [54.335530406959435]
We prove that when the classes of the pre-training task are sufficiently diverse, pre-training can significantly improve the sample efficiency of downstream tasks.
Our proof relies on a vector-form Rademacher complexity chain rule for composite function classes and a modified self-concordance condition.
arXiv Detail & Related papers (2022-09-07T20:10:12Z) - Domain Adaptation in Multilingual and Multi-Domain Monolingual Settings
for Complex Word Identification [0.27998963147546146]
Complex word identification (CWI) is a cornerstone process towards proper text simplification.
CWI is highly dependent on context, whereas its difficulty is augmented by the scarcity of available datasets.
We propose a novel training technique for the CWI task based on domain adaptation to improve the target character and context representations.
arXiv Detail & Related papers (2022-05-15T13:21:02Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Inducing Transformer's Compositional Generalization Ability via
Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions.
Existing neural models have been shown to lack this basic ability in learning symbolic structures.
We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z) - LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for
Lexical Complexity Prediction [4.86331990243181]
This paper describes team LCP-RIT's submission to the SemEval-2021 Task 1: Lexical Complexity Prediction (LCP)
Our system uses logistic regression and a wide range of linguistic features to predict the complexity of single words in this dataset.
We evaluate the results in terms of mean absolute error, mean squared error, Pearson correlation, and Spearman correlation.
arXiv Detail & Related papers (2021-05-18T18:55:04Z) - Pre-training for Abstractive Document Summarization by Reinstating
Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text.
Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z) - CompLex: A New Corpus for Lexical Complexity Prediction from Likert
Scale Data [13.224233182417636]
This paper presents the first English dataset for continuous lexical complexity prediction.
We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts.
arXiv Detail & Related papers (2020-03-16T03:54:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.