Detecting Multiword Expression Type Helps Lexical Complexity Assessment
- URL: http://arxiv.org/abs/2005.05692v1
- Date: Tue, 12 May 2020 11:25:07 GMT
- Title: Detecting Multiword Expression Type Helps Lexical Complexity Assessment
- Authors: Ekaterina Kochmar, Sian Gooding, and Matthew Shardlow
- Abstract summary: Multiword expressions (MWEs) represent lexemes that should be treated as single lexical units due to their idiosyncratic nature.
Multiple NLP applications have been shown to benefit from MWE identification, however the research on lexical complexity of MWEs is still an underexplored area.
- Score: 11.347177310504737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multiword expressions (MWEs) represent lexemes that should be treated as
single lexical units due to their idiosyncratic nature. Multiple NLP
applications have been shown to benefit from MWE identification, however the
research on lexical complexity of MWEs is still an under-explored area. In this
work, we re-annotate the Complex Word Identification Shared Task 2018 dataset
of Yimam et al. (2017), which provides complexity scores for a range of
lexemes, with the types of MWEs. We release the MWE-annotated dataset with this
paper, and we believe this dataset represents a valuable resource for the text
simplification community. In addition, we investigate which types of
expressions are most problematic for native and non-native readers. Finally, we
show that a lexical complexity assessment system benefits from the information
about MWE types.
Related papers
- MINERS: Multilingual Language Models as Semantic Retrievers [23.686762008696547]
This paper introduces the MINERS, a benchmark designed to evaluate the ability of multilingual language models in semantic retrieval tasks.
We create a comprehensive framework to assess the robustness of LMs in retrieving samples across over 200 diverse languages.
Our results demonstrate that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches.
arXiv Detail & Related papers (2024-06-11T16:26:18Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - Analyzing the Role of Semantic Representations in the Era of Large Language Models [104.18157036880287]
We investigate the role of semantic representations in the era of large language models (LLMs)
We propose an AMR-driven chain-of-thought prompting method, which we call AMRCoT.
We find that it is difficult to predict which input examples AMR may help or hurt on, but errors tend to arise with multi-word expressions.
arXiv Detail & Related papers (2024-05-02T17:32:59Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Extracting Polymer Nanocomposite Samples from Full-Length Documents [6.25070848511355]
This paper investigates the use of large language models (LLMs) for extracting sample lists of polymer nanocomposites (PNCs) from full-length materials science research papers.
The challenge lies in the complex nature of PNC samples, which have numerous attributes scattered throughout the text.
arXiv Detail & Related papers (2024-03-01T03:51:56Z) - SEMQA: Semi-Extractive Multi-Source Question Answering [94.04430035121136]
We introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion.
We create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions.
arXiv Detail & Related papers (2023-11-08T18:46:32Z) - Not Enough Labeled Data? Just Add Semantics: A Data-Efficient Method for
Inferring Online Health Texts [0.0]
We employ Abstract Representation (AMR) graphs as a means to model low-resource Health NLP tasks.
AMRs are well suited to model online health texts as they represent multi-sentence inputs, abstract away from complex terminology, and model long-distance relationships.
Our experiments show that we can improve performance on 6 low-resource health NLP tasks by augmenting text embeddings with semantic graph embeddings.
arXiv Detail & Related papers (2023-09-18T15:37:30Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - Predicting Lexical Complexity in English Texts [6.556254680121433]
The first step in most text simplification is to predict which words are considered complex for a given target population.
This task is commonly referred to as Complex Word Identification (CWI) and it is often modelled as a supervised classification problem.
For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required.
arXiv Detail & Related papers (2021-02-17T14:05:30Z) - Zero-Shot Clinical Acronym Expansion via Latent Meaning Cells [2.5374060352463697]
We introduce Latent Meaning Cells, a deep latent variable model which learns contextualized representations of words by combining local lexical context and metadata.
We evaluate the model on the task of zero-shot clinical acronym expansion across three datasets.
arXiv Detail & Related papers (2020-09-29T00:28:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.