CoAM: Corpus of All-Type Multiword Expressions
- URL: http://arxiv.org/abs/2412.18151v1
- Date: Tue, 24 Dec 2024 04:09:33 GMT
- Title: CoAM: Corpus of All-Type Multiword Expressions
- Authors: Yusuke Ide, Joshua Tanner, Adam Nohejl, Jacob Hoffman, Justin Vasselli, Hidetaka Kamigaito, Taro Watanabe,
- Abstract summary: Multiword expressions (MWEs) refer to idiomatic sequences of multiple words.
Existing datasets for MWE identification are inconsistently annotated, limited to a single type of MWE, or limited in size.
CoAM is a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality.
- Score: 21.451123924562598
- License:
- Abstract: Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation. Existing datasets for MWE identification are inconsistently annotated, limited to a single type of MWE, or limited in size. To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking. MWEs in CoAM are tagged with MWE types, such as Noun and Verb, to enable fine-grained error analysis. Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form, including discontinuous ones. Through experiments using CoAM, we find that a fine-tuned large language model outperforms the current state-of-the-art approach for MWE identification. Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - MWE as WSD: Solving Multiword Expression Identification with Word Sense
Disambiguation [0.0]
Recent approaches to word sense disambiguation (WSD) utilize encodings of the sense gloss (definition) to improve performance.
In this work we demonstrate that this approach can be adapted for use in multiword expression (MWE) identification by training models which use gloss and context information.
Our approach substantially improves precision, outperforming the state-of-the-art in MWE identification on the DiMSUM dataset by up to 1.9 F1 points and achieving competitive results on the PARSEME 1.1 English dataset.
arXiv Detail & Related papers (2023-03-12T09:35:42Z) - BERT(s) to Detect Multiword Expressions [9.710464466895521]
Multiword expressions (MWEs) present groups of words in which the meaning of the whole is not derived from the meaning of its parts.
In this paper, we explore state-of-the-art neural transformers in the task of detecting MWEs.
arXiv Detail & Related papers (2022-08-16T16:32:23Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Ultra-Fine Entity Typing with Weak Supervision from a Masked Language
Model [39.031515304057585]
Recently there is an effort to extend fine-grained entity typing by using a richer and ultra-fine set of types.
We propose to obtain training data for ultra-fine entity typing by using a BERT Masked Language Model (MLM)
Given a mention in a sentence, our approach constructs an input for the BERT so that it predicts context dependent hypernyms of the mention, which can be used as type labels.
arXiv Detail & Related papers (2021-06-08T04:43:28Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - Detecting Multiword Expression Type Helps Lexical Complexity Assessment [11.347177310504737]
Multiword expressions (MWEs) represent lexemes that should be treated as single lexical units due to their idiosyncratic nature.
Multiple NLP applications have been shown to benefit from MWE identification, however the research on lexical complexity of MWEs is still an underexplored area.
arXiv Detail & Related papers (2020-05-12T11:25:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.