Domain Adaptation in Multilingual and Multi-Domain Monolingual Settings
for Complex Word Identification
- URL: http://arxiv.org/abs/2205.07283v1
- Date: Sun, 15 May 2022 13:21:02 GMT
- Title: Domain Adaptation in Multilingual and Multi-Domain Monolingual Settings
for Complex Word Identification
- Authors: George-Eduard Zaharia, R\u{a}zvan-Alexandru Sm\u{a}du,
Dumitru-Clementin Cercel, Mihai Dascalu
- Abstract summary: Complex word identification (CWI) is a cornerstone process towards proper text simplification.
CWI is highly dependent on context, whereas its difficulty is augmented by the scarcity of available datasets.
We propose a novel training technique for the CWI task based on domain adaptation to improve the target character and context representations.
- Score: 0.27998963147546146
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Complex word identification (CWI) is a cornerstone process towards proper
text simplification. CWI is highly dependent on context, whereas its difficulty
is augmented by the scarcity of available datasets which vary greatly in terms
of domains and languages. As such, it becomes increasingly more difficult to
develop a robust model that generalizes across a wide array of input examples.
In this paper, we propose a novel training technique for the CWI task based on
domain adaptation to improve the target character and context representations.
This technique addresses the problem of working with multiple domains, inasmuch
as it creates a way of smoothing the differences between the explored datasets.
Moreover, we also propose a similar auxiliary task, namely text simplification,
that can be used to complement lexical complexity prediction. Our model obtains
a boost of up to 2.42% in terms of Pearson Correlation Coefficients in contrast
to vanilla training techniques, when considering the CompLex from the Lexical
Complexity Prediction 2021 dataset. At the same time, we obtain an increase of
3% in Pearson scores, while considering a cross-lingual setup relying on the
Complex Word Identification 2018 dataset. In addition, our model yields
state-of-the-art results in terms of Mean Absolute Error.
Related papers
- Split and Rephrase with Large Language Models [2.499907423888049]
Split and Rephrase (SPRP) task consists in splitting complex sentences into a sequence of shorter grammatical sentences.
We evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics.
arXiv Detail & Related papers (2023-12-18T10:16:37Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Adversarial Adaptation for French Named Entity Recognition [21.036698406367115]
We propose a Transformer-based NER approach for French, using adversarial adaptation to similar domain or general corpora.
Our approach allows learning better features using large-scale unlabeled corpora from the same domain or mixed domains.
We also show that adversarial adaptation to large-scale unlabeled corpora can help mitigate the performance dip incurred on using Transformer models pre-trained on smaller corpora.
arXiv Detail & Related papers (2023-01-12T18:58:36Z) - ASDOT: Any-Shot Data-to-Text Generation with Pretrained Language Models [82.63962107729994]
Any-Shot Data-to-Text (ASDOT) is a new approach flexibly applicable to diverse settings.
It consists of two steps, data disambiguation and sentence fusion.
Experimental results show that ASDOT consistently achieves significant improvement over baselines.
arXiv Detail & Related papers (2022-10-09T19:17:43Z) - Unsupervised Mismatch Localization in Cross-Modal Sequential Data [5.932046800902776]
We develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal data.
We propose a hierarchical Bayesian deep learning model, named mismatch localization variational autoencoder (ML-VAE), that decomposes the generative process of the speech into hierarchically structured latent variables.
Our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations.
arXiv Detail & Related papers (2022-05-05T14:23:27Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Mixed Attention Transformer for LeveragingWord-Level Knowledge to Neural
Cross-Lingual Information Retrieval [15.902630454568811]
We propose a novel Mixed Attention Transformer (MAT) that incorporates external word level knowledge, such as a dictionary or translation table.
By encoding the translation knowledge into an attention matrix, the model with MAT is able to focus on the mutually translated words in the input sequence.
arXiv Detail & Related papers (2021-09-07T00:33:14Z) - X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented
Compositional Semantic Parsing [51.81533991497547]
Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries.
We present X2 compared a transferable Cross-lingual and Cross-domain for TCSP.
We propose to predict flattened intents and slots representations separately and cast both prediction tasks into sequence labeling problems.
arXiv Detail & Related papers (2021-06-07T16:40:05Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.