CoSMo: A constructor specification language for Abstract Wikipedia's
content selection process
- URL: http://arxiv.org/abs/2308.02539v1
- Date: Tue, 1 Aug 2023 13:57:23 GMT
- Title: CoSMo: A constructor specification language for Abstract Wikipedia's
content selection process
- Authors: Kutz Arrieta and Pablo R. Fillottrani and C. Maria Keet
- Abstract summary: Representing snippets of information abstractly is a task that needs to be performed for various purposes.
For the Abstract Wikipedia project, requirements analysis revealed that such an abstract representation requires multilingual modelling.
There is no modelling language that meets either of the three features, let alone a combination.
Following a rigorous language design process, we created CoSMo, a novel sc Content sc Selection sc Modeling language.
- Score: 2.298932494750101
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Representing snippets of information abstractly is a task that needs to be
performed for various purposes, such as database view specification and the
first stage in the natural language generation pipeline for generative AI from
structured input, i.e., the content selection stage to determine what needs to
be verbalised. For the Abstract Wikipedia project, requirements analysis
revealed that such an abstract representation requires multilingual modelling,
content selection covering declarative content and functions, and both classes
and instances. There is no modelling language that meets either of the three
features, let alone a combination. Following a rigorous language design process
inclusive of broad stakeholder consultation, we created CoSMo, a novel {\sc
Co}ntent {\sc S}election {\sc Mo}deling language that meets these and other
requirements so that it may be useful both in Abstract Wikipedia as well as
other contexts. We describe the design process, rationale and choices, the
specification, and preliminary evaluation of the language.
Related papers
- Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition [50.86415025650168]
Masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge.
We propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch.
arXiv Detail & Related papers (2025-03-24T14:53:35Z) - Generating Continuations in Multilingual Idiomatic Contexts [2.0849578298972835]
We test the ability of generative language models (LMs) in understanding nuanced language containing non-compositional figurative text.
We conduct experiments using datasets in two distinct languages (English and Portuguese) under three different training settings.
Our results suggest that the models are only slightly better at generating continuations for literal contexts than idiomatic contexts, with exceedingly small margins.
arXiv Detail & Related papers (2023-10-31T05:40:33Z) - To token or not to token: A Comparative Study of Text Representations
for Cross-Lingual Transfer [23.777874316083984]
We propose a scoring Language Quotient metric capable of providing a weighted representation of both zero-shot and few-shot evaluation combined.
Our analysis reveals that image-based models excel in cross-lingual transfer when languages are closely related and share visually similar scripts.
In dependency parsing tasks where word relationships play a crucial role, models with their character-level focus, outperform others.
arXiv Detail & Related papers (2023-10-12T06:59:10Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - $\mu$PLAN: Summarizing using a Content Plan as Cross-Lingual Bridge [72.64847925450368]
Cross-lingual summarization consists of generating a summary in one language given an input document in a different language.
This work presents $mu$PLAN, an approach to cross-lingual summarization that uses an intermediate planning step as a cross-lingual bridge.
arXiv Detail & Related papers (2023-05-23T16:25:21Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - Multilingual Generative Language Models for Zero-Shot Cross-Lingual
Event Argument Extraction [80.61458287741131]
We present a study on leveraging multilingual pre-trained generative language models for zero-shot cross-lingual event argument extraction (EAE)
By formulating EAE as a language generation task, our method effectively encodes event structures and captures the dependencies between arguments.
Our proposed model finetunes multilingual pre-trained generative language models to generate sentences that fill in the language-agnostic template with arguments extracted from the input passage.
arXiv Detail & Related papers (2022-03-15T23:00:32Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Examining Cross-lingual Contextual Embeddings with Orthogonal Structural
Probes [0.2538209532048867]
A novel Orthogonal Structural Probe (Limisiewicz and Marevcek, 2021) allows us to answer this question for specific linguistic features.
We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT's contextual representations for nine diverse languages.
We successfully apply our findings to zero-shot and few-shot cross-lingual parsing.
arXiv Detail & Related papers (2021-09-10T15:03:11Z) - Language-Agnostic Representation Learning of Source Code from Structure
and Context [43.99281651828355]
We propose a new model, which jointly learns on Context and Structure of source code.
We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages.
arXiv Detail & Related papers (2021-03-21T06:46:06Z) - Breaking Writer's Block: Low-cost Fine-tuning of Natural Language
Generation Models [62.997667081978825]
We describe a system that fine-tunes a natural language generation model for the problem of solving Writer's Block.
The proposed fine-tuning obtains excellent results, even with a small number of epochs and a total cost of USD 150.
arXiv Detail & Related papers (2020-12-19T11:19:11Z) - Scalable Cross-lingual Document Similarity through Language-specific
Concept Hierarchies [0.0]
This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora.
The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels.
Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.
arXiv Detail & Related papers (2020-12-15T10:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.