COMPILING: A Benchmark Dataset for Chinese Complexity Controllable
Definition Generation
- URL: http://arxiv.org/abs/2209.14614v1
- Date: Thu, 29 Sep 2022 08:17:53 GMT
- Title: COMPILING: A Benchmark Dataset for Chinese Complexity Controllable
Definition Generation
- Authors: Jiaxin Yuan, Cunliang Kong, Chenhui Xie, Liner Yang, Erhong Yang
- Abstract summary: This paper proposes a novel task of generating definitions for a word with controllable complexity levels.
We introduce COMPILING, a dataset given detailed information about Chinese definitions, and each definition is labeled with its complexity levels.
- Score: 2.935516292500541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The definition generation task aims to generate a word's definition within a
specific context automatically. However, owing to the lack of datasets for
different complexities, the definitions produced by models tend to keep the
same complexity level. This paper proposes a novel task of generating
definitions for a word with controllable complexity levels. Correspondingly, we
introduce COMPILING, a dataset given detailed information about Chinese
definitions, and each definition is labeled with its complexity levels. The
COMPILING dataset includes 74,303 words and 106,882 definitions. To the best of
our knowledge, it is the largest dataset of the Chinese definition generation
task. We select various representative generation methods as baselines for this
task and conduct evaluations, which illustrates that our dataset plays an
outstanding role in assisting models in generating different complexity-level
definitions. We believe that the COMPILING dataset will benefit further
research in complexity controllable definition generation.
Related papers
- Benchmarking Complex Instruction-Following with Multiple Constraints Composition [72.82640456309821]
How to evaluate the ability of complex instruction-following of large language models (LLMs) has become a critical research problem.
Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints.
We propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints.
arXiv Detail & Related papers (2024-07-04T14:50:45Z) - A General Model for Aggregating Annotations Across Simple, Complex, and
Multi-Object Annotation Tasks [51.14185612418977]
A strategy to improve label quality is to ask multiple annotators to label the same item and aggregate their labels.
While a variety of bespoke models have been proposed for specific tasks, our work is the first to introduce aggregation methods that generalize across many diverse complex tasks.
This article extends our prior work with investigation of three new research questions.
arXiv Detail & Related papers (2023-12-20T21:28:35Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - DetermiNet: A Large-Scale Diagnostic Dataset for Complex
Visually-Grounded Referencing using Determiners [5.256237513030104]
DetermiNet dataset comprises 250,000 synthetically generated images and captions based on 25 determiners.
The task is to predict bounding boxes to identify objects of interest, constrained by the semantics of the given determiner.
We find that current state-of-the-art visual grounding models do not perform well on the dataset.
arXiv Detail & Related papers (2023-09-07T05:13:52Z) - Thinking Like an Annotator: Generation of Dataset Labeling Instructions [59.603239753484345]
We introduce a new task, Labeling Instruction Generation, to address missing publicly available labeling instructions.
We take a reasonably annotated dataset and: 1) generate a set of examples that are visually representative of each category in the dataset; 2) provide a text label that corresponds to each of the examples.
This framework acts as a proxy to human annotators that can help to both generate a final labeling instruction set and evaluate its quality.
arXiv Detail & Related papers (2023-06-24T18:32:48Z) - Assisting Language Learners: Automated Trans-Lingual Definition
Generation via Contrastive Prompt Learning [25.851611353632926]
The standard definition generation task requires to automatically produce mono-lingual definitions.
We propose a novel task of Trans-Lingual Definition Generation (TLDG), which aims to generate definitions in another language.
arXiv Detail & Related papers (2023-06-09T17:32:45Z) - Deep Sequence Models for Text Classification Tasks [0.007329200485567826]
Natural Language Processing (NLP) is equipping machines to understand human diverse and complicated languages.
Common text classification application includes information retrieval, modeling news topic, theme extraction, sentiment analysis, and spam detection.
Sequence models such as RNN, GRU, and LSTM is a breakthrough for tasks with long-range dependencies.
Results generated were excellent with most of the models performing within the range of 80% and 94%.
arXiv Detail & Related papers (2022-07-18T18:47:18Z) - Multitasking Framework for Unsupervised Simple Definition Generation [5.2221935174520056]
We propose a novel task of Simple Definition Generation to help language learners and low literacy readers.
A significant challenge of this task is the lack of learner's dictionaries in many languages.
We propose a multitasking framework SimpDefiner that only requires a standard dictionary with complex definitions and a corpus containing arbitrary simple texts.
arXiv Detail & Related papers (2022-03-24T08:16:04Z) - Data-to-text Generation with Variational Sequential Planning [74.3955521225497]
We consider the task of data-to-text generation, which aims to create textual output from non-linguistic input.
We propose a neural model enhanced with a planning component responsible for organizing high-level information in a coherent and meaningful way.
We infer latent plans sequentially with a structured variational model, while interleaving the steps of planning and generation.
arXiv Detail & Related papers (2022-02-28T13:17:59Z) - CDM: Combining Extraction and Generation for Definition Modeling [8.487707405248242]
We propose to combine extraction and generation for definition modeling.
First extract self- and correlative definitional information of target terms from the Web.
Then generate the final definitions by incorporating the extracted definitional information.
arXiv Detail & Related papers (2021-11-14T08:03:18Z) - Structured Prediction as Translation between Augmented Natural Languages [109.50236248762877]
We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks.
Instead of tackling the problem by training task-specific discriminatives, we frame it as a translation task between augmented natural languages.
Our approach can match or outperform task-specific models on all tasks, and in particular, achieves new state-of-the-art results on joint entity and relation extraction.
arXiv Detail & Related papers (2021-01-14T18:32:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.