MGen: Millions of Naturally Occurring Generics in Context
- URL: http://arxiv.org/abs/2509.26160v1
- Date: Tue, 30 Sep 2025 12:13:51 GMT
- Title: MGen: Millions of Naturally Occurring Generics in Context
- Authors: Gustavo Cilleruelo, Emily Allaway, Barry Haddow, Alexandra Birch,
- Abstract summary: MGen is a dataset of over 4 million naturally occurring generic and quantified sentences.<n>We analyze the features of generics sentences in the dataset, with interesting insights.<n>MGen is the biggest and most diverse dataset of naturally occurring generic sentences.
- Score: 75.4707956240456
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at https://gustavocilleruelo.com/mgen
Related papers
- AmalREC: A Dataset for Relation Extraction and Classification Leveraging Amalgamation of Large Language Models [0.7381551917607596]
This study has focused on the following major questions: (i) how to generate sentences from relations, (ii) how to compare and rank them, (iii) can we combine strengths of individual methods and amalgamate them to generate an even bette quality of sentences, and (iv) how to evaluate the final dataset?
arXiv Detail & Related papers (2024-12-29T10:36:33Z) - Generics are puzzling. Can language models find the missing piece? [70.14604603488178]
We study the implicit quantification and context-sensitivity of generics by leveraging language models as models of language.<n>We create ConGen, a dataset of 2873 naturally occurring generic and quantified sentences in context.<n>Our experiments show generics are more context-sensitive than determiner quantifiers and about 20% of naturally occurring generics we analyze express weak generalisations.
arXiv Detail & Related papers (2024-12-15T21:30:21Z) - PersonalSum: A User-Subjective Guided Personalized Summarization Dataset for Large Language Models [3.516029765200171]
We propose a high-quality, personalized, manually annotated abstractive summarization dataset called PersonalSum.
This dataset is the first to investigate whether the focus of public readers differs from the generic summaries generated by Large Language Models.
arXiv Detail & Related papers (2024-10-04T20:12:39Z) - Compositional Generalization for Data-to-Text Generation [86.79706513098104]
We propose a novel model that addresses compositional generalization by clustering predicates into groups.
Our model generates text in a sentence-by-sentence manner, relying on one cluster of predicates at a time.
It significantly outperforms T5baselines across all evaluation metrics.
arXiv Detail & Related papers (2023-12-05T13:23:15Z) - On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries.
Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens.
We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z) - V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages [21.018996007110324]
This dataset includes 41.8 million news articles in 14 different Indic languages (and English)
To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available.
arXiv Detail & Related papers (2023-05-10T03:07:17Z) - Penguins Don't Fly: Reasoning about Generics through Instantiations and
Exceptions [73.56753518339247]
We present a novel framework informed by linguistic theory to generate exemplars -- specific cases when a generic holds true or false.
We generate 19k exemplars for 650 generics and show that our framework outperforms a strong GPT-3 baseline by 12.8 precision points.
arXiv Detail & Related papers (2022-05-23T22:45:53Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - GenericsKB: A Knowledge Base of Generic Statements [18.68800894936855]
We present a new resource for the NLP community, namely a large (3.5M+ sentence) knowledge base of *generic statements*
This is the first large resource to contain *naturally occurring* generic sentences, as opposed to extracted or crowdsourced triples.
All GenericsKB sentences are annotated with their topical term, surrounding context (sentences), and a (learned) confidence.
arXiv Detail & Related papers (2020-05-02T00:08:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.