Cross-lingual Transfer for Text Classification with Dictionary-based
Heterogeneous Graph
- URL: http://arxiv.org/abs/2109.04400v2
- Date: Fri, 10 Sep 2021 03:26:31 GMT
- Title: Cross-lingual Transfer for Text Classification with Dictionary-based
Heterogeneous Graph
- Authors: Nuttapong Chairatanakul, Noppayut Sriwatanasakdi, Nontawat
Charoenphakdee, Xin Liu, Tsuyoshi Murata
- Abstract summary: In cross-lingual text classification, it is required that task-specific training data in high-resource source languages are available.
Collecting such training data can be infeasible because of the labeling cost, task characteristics, and privacy concerns.
This paper proposes an alternative solution that uses only task-independent word embeddings of high-resource languages and bilingual dictionaries.
- Score: 10.64488240379972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In cross-lingual text classification, it is required that task-specific
training data in high-resource source languages are available, where the task
is identical to that of a low-resource target language. However, collecting
such training data can be infeasible because of the labeling cost, task
characteristics, and privacy concerns. This paper proposes an alternative
solution that uses only task-independent word embeddings of high-resource
languages and bilingual dictionaries. First, we construct a dictionary-based
heterogeneous graph (DHG) from bilingual dictionaries. This opens the
possibility to use graph neural networks for cross-lingual transfer. The
remaining challenge is the heterogeneity of DHG because multiple languages are
considered. To address this challenge, we propose dictionary-based
heterogeneous graph neural network (DHGNet) that effectively handles the
heterogeneity of DHG by two-step aggregations, which are word-level and
language-level aggregations. Experimental results demonstrate that our method
outperforms pretrained models even though it does not access to large corpora.
Furthermore, it can perform well even though dictionaries contain many
incorrect translations. Its robustness allows the usage of a wider range of
dictionaries such as an automatically constructed dictionary and crowdsourced
dictionary, which are convenient for real-world applications.
Related papers
- Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z) - Personalized Dictionary Learning for Heterogeneous Datasets [6.8438089867929905]
We introduce a relevant yet challenging problem named Personalized Dictionary Learning (PerDL)
The goal is to learn sparse linear representations from heterogeneous datasets that share some commonality.
In PerDL, we model each dataset's shared and unique features as global and local dictionaries.
arXiv Detail & Related papers (2023-05-24T16:31:30Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries.
We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z) - Dict-BERT: Enhancing Language Model Pre-training with Dictionary [42.0998323292348]
Pre-trained language models (PLMs) aim to learn universal language representations by conducting self-supervised training tasks on large-scale corpora.
In this work, we focus on enhancing language model pre-training by leveraging definitions of rare words in dictionaries.
We propose two novel self-supervised pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions.
arXiv Detail & Related papers (2021-10-13T04:29:14Z) - Cross-lingual Text Classification with Heterogeneous Graph Neural
Network [2.6936806968297913]
Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages.
Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks.
We propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification.
arXiv Detail & Related papers (2021-05-24T12:45:42Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Classification of Chinese Handwritten Numbers with Labeled Projective
Dictionary Pair Learning [1.8594711725515674]
We design class-specific dictionaries incorporating three factors: discriminability, sparsity and classification error.
We adopt a new feature space, i.e., histogram of oriented gradients (HOG), to generate the dictionary atoms.
Results demonstrated enhanced classification performance $(sim98%)$ compared to state-of-the-art deep learning techniques.
arXiv Detail & Related papers (2020-03-26T01:43:59Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.