MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain
Acronym Extraction
- URL: http://arxiv.org/abs/2202.09694v1
- Date: Sat, 19 Feb 2022 23:08:38 GMT
- Title: MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain
Acronym Extraction
- Authors: Amir Pouran Ben Veyseh, Nicole Meister, Seunghyun Yoon, Rajiv Jain,
Franck Dernoncourt, Thien Huu Nguyen
- Abstract summary: Acronyms and their expanded forms are necessary for various NLP applications.
One limitation of existing AE research is that they are limited to the English language and certain domains.
Lacking annotated datasets in multiple languages and domains has been a major issue to hinder research in this area.
- Score: 66.60031336330547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Acronym extraction is the task of identifying acronyms and their expanded
forms in texts that is necessary for various NLP applications. Despite major
progress for this task in recent years, one limitation of existing AE research
is that they are limited to the English language and certain domains (i.e.,
scientific and biomedical). As such, challenges of AE in other languages and
domains is mainly unexplored. Lacking annotated datasets in multiple languages
and domains has been a major issue to hinder research in this area. To address
this limitation, we propose a new dataset for multilingual multi-domain AE.
Specifically, 27,200 sentences in 6 typologically different languages and 2
domains, i.e., Legal and Scientific, is manually annotated for AE. Our
extensive experiments on the proposed dataset show that AE in different
languages and different learning settings has unique challenges, emphasizing
the necessity of further research on multilingual and multi-domain AE.
Related papers
- LexGen: Domain-aware Multilingual Lexicon Generation [40.97738267067852]
We propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting.
Our model consists of domain-specific and domain-generic layers that encode information.
We release a new benchmark dataset across 6 Indian languages that span 8 diverse domains.
arXiv Detail & Related papers (2024-05-18T07:02:43Z) - MEMD-ABSA: A Multi-Element Multi-Domain Dataset for Aspect-Based
Sentiment Analysis [23.959356414518957]
We propose a large-scale Multi-Element Multi-Domain dataset (MEMD) that covers the four elements across five domains.
We evaluate generative and non-generative baselines on multiple ABSA subtasks under the open domain setting.
arXiv Detail & Related papers (2023-06-29T14:03:49Z) - Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey [100.24095818099522]
Large language models (LLMs) have significantly advanced the field of natural language processing (NLP)
They provide a highly useful, task-agnostic foundation for a wide range of applications.
However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles.
arXiv Detail & Related papers (2023-05-30T03:00:30Z) - MINION: a Large-Scale and Diverse Dataset for Multilingual Event
Detection [65.46122357928041]
Event Detection (ED) is the task of identifying and classifying trigger words of event mentions in text.
Main questions include how well existing ED models perform on different languages, how challenging ED is in other languages, and how well ED knowledge and annotation can be transferred across languages.
We introduce a new large-scale multilingual dataset for ED (called MINION) that consistently annotates events for 8 different languages.
arXiv Detail & Related papers (2022-11-11T02:09:51Z) - Crossing the Conversational Chasm: A Primer on Multilingual
Task-Oriented Dialogue Systems [51.328224222640614]
Current state-of-the-art ToD models based on large pretrained neural language models are data hungry.
Data acquisition for ToD use cases is expensive and tedious.
arXiv Detail & Related papers (2021-04-17T15:19:56Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - What Does This Acronym Mean? Introducing a New Dataset for Acronym
Identification and Disambiguation [74.42107665213909]
Acronyms are the short forms of phrases that facilitate conveying lengthy sentences in documents and serve as one of the mainstays of writing.
Due to their importance, identifying acronyms and corresponding phrases (AI) and finding the correct meaning of each acronym (i.e., acronym disambiguation (AD)) are crucial for text understanding.
Despite the recent progress on this task, there are some limitations in the existing datasets which hinder further improvement.
arXiv Detail & Related papers (2020-10-28T00:12:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.