Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content
- URL: http://arxiv.org/abs/2206.11612v2
- Date: Mon, 1 Apr 2024 18:18:51 GMT
- Title: Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content
- Authors: Chia-Hsuan Chang, Lei Wang, Christopher C. Yang,
- Abstract summary: The open-access and collaborative consumer health vocabulary (OAC CHV) is the controlled vocabulary for addressing such a challenge.
This research proposes a cross-lingual automatic term recognition framework for extending the English CHV into a cross-lingual one.
- Score: 2.4316589174722485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The online health community (OHC) is the primary channel for laypeople to share health information. To analyze the health consumer-generated content (HCGC) from the OHCs, identifying the colloquial medical expressions used by laypeople is a critical challenge. The open-access and collaborative consumer health vocabulary (OAC CHV) is the controlled vocabulary for addressing such a challenge. Nevertheless, OAC CHV is only available in English, limiting its applicability to other languages. This research proposes a cross-lingual automatic term recognition framework for extending the English CHV into a cross-lingual one. Our framework requires an English HCGC corpus and a non-English (i.e., Chinese in this study) HCGC corpus as inputs. Two monolingual word vector spaces are determined using the skip-gram algorithm so that each space encodes common word associations from laypeople within a language. Based on the isometry assumption, the framework aligns two monolingual spaces into a bilingual word vector space, where we employ cosine similarity as a metric for identifying semantically similar words across languages. The experimental results demonstrate that our framework outperforms the other two large language models in identifying CHV across languages. Our framework only requires raw HCGC corpora and a limited size of medical translations, reducing human efforts in compiling cross-lingual CHV.
Related papers
- Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment [50.80949663719335]
multilingual sentence encoders (MSEs) are commonly obtained by training multilingual language models to map sentences from different languages into a shared semantic space.<n>MSEs are subject to curse of multilinguality, a loss of monolingual representational accuracy due to parameter sharing.<n>We train the cross-lingual adapters with two different types of data to resolve the conflicting requirements of different cross-lingual tasks.
arXiv Detail & Related papers (2024-07-20T13:56:39Z) - Grammatical Error Correction for Code-Switched Sentences by Learners of English [5.653145656597412]
We conduct the first exploration into the use of Grammar Error Correction systems on CSW text.
We generate synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora.
We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints.
Our best model achieves an average increase of 1.57 $F_0.5$ across 3 CSW test sets without affecting the model's performance on a monolingual dataset.
arXiv Detail & Related papers (2024-04-18T20:05:30Z) - Evolution of Efficient Symbolic Communication Codes [0.0]
The paper explores how the human natural language structure can be seen as a product of evolution of inter-personal communication code.
It aims to maximise such culture-agnostic and cross-lingual metrics such as anti-entropy, compression factor and cross-split F1 score.
arXiv Detail & Related papers (2023-06-04T15:33:16Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Detecting Cross-Language Plagiarism using Open Knowledge Graphs [7.378348990383349]
We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis.
CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata.
It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections.
arXiv Detail & Related papers (2021-11-18T15:23:27Z) - Monolingual and Cross-Lingual Acceptability Judgments with the Italian
CoLA corpus [2.418273287232718]
We describe the ItaCoLA corpus, containing almost 10,000 sentences with acceptability judgments.
We also present the first cross-lingual experiments, aimed at assessing whether multilingual transformerbased approaches can benefit from using sentences in two languages during fine-tuning.
arXiv Detail & Related papers (2021-09-24T16:18:53Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Investigating Language Impact in Bilingual Approaches for Computational
Language Documentation [28.838960956506018]
This paper investigates how the choice of translation language affects the posterior documentation work.
We create 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.
Our results suggest that incorporating clues into the neural models' input representation increases their translation and alignment quality.
arXiv Detail & Related papers (2020-03-30T10:30:34Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.